Posts
Comments
To the people disagreeing, what part do you disagree with? My main point, or my example? Or something else
I think this is especially important for me/us to remember. On this site we often have a complex way of thinking, and a high computational budget (because we like exercising our brains to failure) and if we speak freely to the average person, they mat be annoyed at how hard it is to parse what we are saying.
We've all probably had this experience when genuinely trying to understand someone from a very different background. Perhaps they are trying to describe their inner experience when mediating, or Japanese poetry, or are simply from a different't discipline. Or perhaps we were just very tired that day, meaning we had a low computational budget.
On the other hand, we are often a "tell" culture, which had a lower computational load compared to ask or guess culture. As long as we don't tell too much.
I would add:
- Must also generalise better than capabilities!
- out of distribution
- to smarter models
Currently, we do not know how to make sure machine learning generalises well out of sample. This is an open problem that is critical to alignment. I find that it's left out of evals frustratingly often, probably because it's hard, and most methods miserably fail to generalise OOD.
For example, you don't want your ASI to become unaligned, have value drift, or extrapolate human values poorly when, for example, 1) it meets aliens, 2) 1000 years pass, or cultural drift happens. What if your descendants think it's admirable and funny to take hostages as a form of artistic practical joke, you would hope that your AI's would handle that in a principled and adaptable manner. At the very least, you want its capability to fail before its morality.
- An overlooked benchmark of text OOD: GENIES
- mention of how OOD is an open problem in "Towards out of distribution generalization for problems in mechanics"
One blind spot we rationalists sometimes have is that charismatic people actually treat the game as:
"Can I think of an association that will make the other person feel good and/or further my goal?". You need people to feel good, or they won't participate. And if you want some complicated/favour/uncomftorble_truth then you better mix in some good feels to balance it out and keep the other person participating.
To put it another way: If you hurt people's brain or ego, rush them, or make them feel unsure, or contradict them, then most untrained humans will feel a little bad. Why would they want to keep feeling bad? Do you like it when people don't listen, contradict you, insult you, rush you, disagree with you? Probably not, probobly no one does.
But if someone listens to you, smiles at you, likes you, has a good opinion of you, agrees with you, make sense to you. Then it feels good!
This might sound dangerously sycophantic, and that's because it is - if people overdo it! But if it's mixed with some healthy understanding, learning, informing then It's a great conversational lubricant, and you should apply as needed. It just ensures that everyone enjoys themselves and comes back for more, counteracting the normal frictions of socialising.
There are books about this. "How to Win Friends and Influence People" recommends talking about the other person's interests (including themselves) and listening to them, which they will enjoy.
So I'd say, don't just free associate. Make sure it's fun for both parties, make room to listen to the other person, and to let them steer. (And ideally your conversational partner reciprocates, but that is not guaranteed).
Is machine learning in a period of multiple discovery?
Anecdotally, it feels as though we have entered a period of multiple discovery in machine learning, with numerous individuals coming up with very similar ideas.
Logically, this can be expected when more people pursue the same low-hanging fruit. Imagine orchards in full bloom with a crowd of hungry gatherers. Initially, everyone targets the nearest fruit. Exploring a new scientific frontier can feel somewhat similar. When reading the history books on the Enlightenment, I get a similar impression.
If we are indeed in a period of multiple discovery, we should not simply go after the nearest prize; it will soon be claimed. Your time is better spent looking further afield or exploring broader horizons.
Is any of this backed by empirical evidence? No! I have simply plotted Wikipedia's list of multiple discoveries. It shows multiple discoveries increasing with population, I don't see any distinct periods, so it's inconclusive.
I made up the made-up numbers in this table of made-up numbers; therefore, the numbers in this table of made-up numbers are made-up numbe
These hallucinated outputs are really getting out of hand
Some reports of people who have tried it https://old.reddit.com/r/financialindependence/comments/a9h20a/has_anyone_fired_to_a_boat_full_time_how_did_it/
In particular, I'd be keen to know what @Stag and @technicalities think, as this was in large part inspired by the desire to further simplify and categorise the "one sentence summaries" from their excellent Shallow review of live agendas in alignment & safety
If anyone finds this useful, please let me know. I've abandoned it because none of my test audience found it interesting or useful. That's OK, it just means it's better to focus on other things.
Draft: A cartoonishly simplified overview of some technical AI alignment efforts
Epistemic status: excessive lossy compression applied
How are people actually trying to make friendly AI? Here are few simplified examples
LessWrong has some great technical and critical overviews of alignment agendas, but for many readers they take too long to read.
But first, what is the problem we are trying to solve? I would say that we want to program AI's that have our values (or better). And we want those values to persist, even when the AI gets smarter, even after long spans of subjective time, even when encountering situations humans have never encountered before.
This is hard because we don't know how to do any of those things! So how do people propose to solve this/
Here's my attempt at cartoonishly simplified explanations of technical alignment efforts:
-
- Just make a friendly dumb AI and that will make a smarter friendly AI, and so on - Iterated Amplification
- Just make AI's debate each other, so that the truth comes out when we look at both sides of the argument
- Just have the AI follow a constitution
- Just make a friendly automatic alignment researcher - Superalignment
-
Let's have lots of AI's interacting:
- Just have multiple focused AIs competing in various roles, kind of like a corporation - Drexler's The Open Agency Model
- Just use dumb AI's to build an understandable world simulation, then train a smart AI in that simulation so that we can verify that it's aligned - DavidAD's plan
- Just have the AI learn and respect other beings' boundaries - Boundaries/Membranes
-
Let's make sure the goal is good
- Just design the AI to want to help humans but be maximally uncertain about how to do that, so it constantly seeks guidance - CIRL
- Just make a lazy AI that is satisfied after doing "enough" - Mild Optimization
-
Let's build tools that will let us control smarter AI's
- Just read their minds - ELK
- Just edit their minds - Activation Steering
-
Let's understand more
- What it means to be an agent - agent foundations
Some historic proposals sounded promising but seem to have been abandoned fow now, I include this to show how hard the problem is:
- Just keep the AI in a sandbox environment where it can't cause any real-world harm while we continue to align and train it - AI Boxing
- Just make an oracle AI that only talks, not acts
- simulate a human thinking for a very long time about the alignment problem, and use whatever solution they write down - TODO Another proposal by Christiano (2012):
- Do what we would wish for if we knew more, thought faster, were more the people we wished we were - coherent extrapolated volition (CEV)
I've left out the many debates over the proposals. I'm afraid that you need to dig much deeper to judge which methods will work. If you want to know more, just follow the links below.
If you dislike this: please help me make it better by contributing better summaries, and I'll be pleased to include them.
If you would like to know more, I recommend these overviews:
- 2023 - Shallow review of live agendas in alignment & safety - I've taken heavily from this post which has one sentence summaries as well as much, much more
- 2022 A newcomer’s guide to the technical AI safety field
- 2023 - A Brief Overview of AI Safety/Alignment Orgs, Fields, Researchers, and Resources for ML Researchers
- 2023 - The Genie in the Bottle: An Introduction to AI Alignment and Risk
- 2022 - On how various plans miss the hard bits of the alignment challenge
- 2022 - (My understanding of) What Everyone in Technical Alignment is Doing and Why
If we regularise human values, what would happen?
We want AI to take up human values, in a way that generalizes out of distribution better than our capabilities. In other words, if our AI meets friendly aliens, we want its gun skills to fail before its morals.
Human values - as expressed by actions - seem messy, inconsistent, and hypocritical. If we forced an AI to regularise them, would it be wiser, more consistent, and simply good? Or would it oversimplify and make a brittle set of values.
I would predict that a simplified and more consistent form of human values would extrapolate better to a new situation. But if they are too simplified, they will be brittle and fail to generalize or perform how we want. I expect there to be quite a lot of simplification that can be performed initially, and I expect it to be interesting and helpful.
Perhaps we could even remove the parts of human values that are to an individual's advantage. Removing self-preservation, self advantage, then ensuring consistency would lead to an interesting set of values.
It would be an interesting experimental direction. Perhaps an early experiment could make LLM answers conditional on a latent space which encodes values. It would be much like the image generators that are conditional on style. As a bonus, you would have this nice regular space of values where you can examine and play with.
Interesting, that changes my mind somewhat.
I wonder why this happens?! I can't find any examples of this in the pile, and their filtering doesn't seem to add it. It's hard to imagine humans generating this particular refusal response. Perhaps it's a result of filtering, s
Are these really pure base models? I've also noticed this kind of behaviour in so-called base models. My conclusion is that they are not base models in the sense that they have been trained to predict the next word on the internet, but they have undergone some safety fine-tuning before release. We don't actually know how they were trained, and I am suspicious. It might be best to test on Pythia or some model where we actually know.
I wonder where the best places to write are. I'd say Reddit and GitHub are good bets, but you would have to get through their filtering, for karma, stars, language, subreddit etc.
CMIIW, you are looking at information content according to the LLM, but that's not enough. It has to be learnable information content to avoid the noisy TV problem. E.g. a random sequence of tokens will be unpredictable and high perplexity. But if it's learnable, then it has potential.
I had a go at a few different approaches here https://github.com/wassname/detect_bs_text
But if it's a poor split, wouldn't it also favour the baseline (your software). But they did beat the baseline. So if your concern is correct, they did outperform the baseline, but just didn't realistically measure generalisation to radically different structures.
So it's not fair to say 'it's only memorisation'. It seems fairer to say 'it doesn't generalise enough to be docking software, and this is not obvious at first due to a poor choice of train test split'.
I thought this too, until someone in finance told me to google "Theranos Board of Directors", so I did, and it looked a lot like OpenAI's new board.
This provides an alternate hypothesis: That signals nothing substantial. Perhaps it's empty credentialism, or empty PR, or a cheap attempt to win military contacts.
It could indicate the importance of security, which is safe. Or of escalation in a military arms race, which is unsafe.
Also DEIR needs to implicitly distinguish between things it caused, and things it didn't https://arxiv.org/abs/2304.10770
Your "given a lab environment where we strongly expect an AI to be roughly human level" seems to assume the thing to be proven.
But if we are being Bayesians here, then it seems to become clearer. Let's assume that the law says we can safely and effectively evaluate an AGI that's truly 0-200% as smart as the smartest human alive (revisit this as our alignment tools improve). Now, as you say, in the real world there is dangerous uncertainty about this. So how much probability do we put on, say, GPT-5 exceeding that limit? That's our risk, and it either meets or exceeds our risk budget. If there was a regulatory body overseeing that, you would need to submit the evidence to back this up, and they would hire smart people or AI to make the judgments (the FDA does this now, to mixed effects). The applicant can put in work to derisk the prospect.
With a brand-new architecture, you are right, we shouldn't scale it up and test the big version; that would be too risky! We evaluate small versions first and use that to establish priors about how capabilities scale. This is generally how architecture improvements works, anyway, so it wouldn't require extra work.
Does that make sense? To be, this seems like a more practical approach than once-and-done alignment, or pause everything. And most importantly of all, it seems to set up the incentives correctly - commercial success is dependent on convincing us you have narrower AI, better alignment tools, or have increased human intelligence.
If you have an approach I think is promising, I'll drop everything and lobby funders to give you a big salary to work on your approach.
I wish I did, mate ;p Even better than a big salary, it would give me and my kids a bigger chance to survive.
I'm not so sure. Given a lab environment where we strongly expect an AI to be roughly human level, and have the ability to surgically edit its brain (e.g. mechinterp), revert to a previous checkpoint, and put it in simulated environments. I would say it's pretty easy. This is the situation we have right now.
If we limit AI development to those that we reasonably expect to be in this range, we might stay in this situation.
I do agree that a secret ASI, would be much more difficult, but that's why we don't even attempt to risk a large expected intelligence gap/ratio.
We should limit the intelligence gap between machines and humans to, say, 150%. I think control will always fail for a big enough gap. The size of the gap we can manage will depend on how good our alignment and control tools turn out to be.
The strategy here is to limit machine intelligence to be C% move than the smartest human (I_H). That way, the smartest AI (I_AI) will be a function of 1) how good our alignment/control tools are and 2) how smart we can increase humans.
I_AI = I_H * C
Definitely A, and while it's clear MIRI means well, I'm suggesting a focus on preventing military and spy arms races in AI. Because it seems like a likely failure mode, which no one is focusing on. It seems like a place where a bunch of blunt people can expand the Overton window to everyone's advantage.
MIRI has used nuclear non-proliferation as an example (getting lots of pushback). But non-proliferation did not stop new countries from getting the bomb, it did certainly did not stop existing countries from scaling up their nuclear arsenals. Global de-escalation after the end of the Cold War is what caused that. For example, look at this graph it doesn't go down after the 1968 treaty, it goes down after the Cold War (>1985).
We would not want to see a similar situation with AI, where existing countries race to scale up their efforts and research.
This is in no way a criticism, MIRI is probably already doing the most here, and facing criticism for it. I'm just suggesting the idea.
Have you considered emphasizing this part of your position:
"We want to shut down AGI research including governments, military, and spies in all countries".
I think this is an important point that is missed in current regulation, which focuses on slowing down only the private sector. It's hard to achieve because policymakers often favor their own institutions, but it's absolutely needed, so it needs to be said early and often. This will actually win you points with the many people who are cynical of the institutions, who are not just libertarians, but a growing portion of the public.
I don't think anyone is saying this, but it fits your honest and confronting communication strategy.
Just build the good John's but not the bad Johns.
This argument is empirical, while the orthogonality hypothesis is merely philosophical, which means this is a stronger argument imo.
But this argument does not imply alignment is easy. It implies that acute goals are easy, while orthogonal goals are hard. Therefore, a player of games agent will be easy to align with power-seeking, but hard to align with banter. A chat agent will be easy to align with banter and hard to align with power-seeking.
We are currently in the chat phase, which seems to imply easier alignment to chatty huggy human values, but we might soon enter the player of long-term games phase. So this argument implies that alignment is currently easier, and if we enter the era of RL long-term planning agents, it will get harder.
Feel free to suggest improvements, it's just what worked for me, but is limited in format
If you are using llama you can use https://github.com/wassname/prob_jsonformer, or snippets of the code to get probabilities over a selection of tokens
That's true, they are different. But search still provides the closest historical analogue (maybe employees/suppliers provide another). Historical analogues have the benefit of being empirical and grounded, so I prefer them over (or with) pure reasoning or judgement.
When you rephrase this to be about search engines
I think the main reason why we won't censor search to some abstract conception of "community values" is because users won't want to rent or purchase search services that are censor to such a broad target
It doesn't describe reality. Most of us consume search and recommendations that has been censored (e.g. removing porn, piracy, toxicity, racism, taboo politics) in a way that pus cultural values over our preferences or interests.
So perhaps it won't be true for AI either. At least in the near term, the line between AI and search is a blurred line, and the same pressures exist on consumers and providers.
A before and after would be even better!
Thanks, but this doesn't really give insight on whether this is normal or enforceable. So I wanted to point out, we don't know if it's enforcible, and have not seen a single legal opinion.
Thanks, I hadn't seen that, I find it convincing.
He might have returned to work, but agreed to no external coms.
Interesting! For most of us, this is outside our area of competence, so appreciate your input.
Are you familiar with USA NDA's? I'm sure there are lots of clauses that have been ruled invalid by case law? In many cases, non-lawyers have no ideas about these, so you might be able to make a difference with very little effort. There is also the possibility that valuable OpenAI shares could be rescued?
If you haven't seen it, check out this thread where one of the OpenAI leavers did not sigh the gag order.
It could just be because it reaches a strong conclusion on anecdotal/clustered evidence (e.g. it might say more about her friend group than anything else). Along with claims to being better calibrated for weak reasons - which could be true, but seems not very epistemically humble.
Full disclosure I downvoted karma, because I don't think it should be top reply, but I did not agree or disagree.
But Jen seems cool, I like weird takes, and downvotes are not a big deal - just a part of a healthy contentious discussion.
Notably, there are some lawyers here on LessWrong who might help (possibly even for the lols, you never know). And you can look at case law and guidance to see if clauses are actually enforceable or not (many are not). To anyone reading, here's habryka doing just that
One is the change to the charter to allow the company to work with the military.
https://news.ycombinator.com/item?id=39020778
I think the board must be thinking about how to get some independence from Microsoft, and there are not many entities who can counterbalance one of the biggest companies in the world. The government's intelligence and defence industries are some of them (as are Google, Meta, Apple, etc). But that move would require secrecy, both to stop nationalistic race conditions, and by contract, and to avoid a backlash.
EDIT: I'm getting a few disagrees, would someone mind explaining why they disagree with these wild speculations?
Here's something I've been pondering.
hypothesis: If transformers has internal concepts, and they are represented in the residual stream. Then because we have access to 100% of the information then it should be possible for a non-linear probe to get 100% out of distribution accuracy. 100% is important because we care about how a thing like value learning will generalise OOD.
And yet we don't get 100% (in fact most metrics are much easier than what we care about, being in-distribution, or on careful setups). What is wrong with the assumptions hypothesis, do you think?
better calibrated than any of these opinions, because most of them don't seem to focus very much on "hedging" or "thoughtful doubting"
new observations > new thoughts when it comes to calibrating yourself.
The best calibrated people are people who get lots of interaction with the real world, not those who think a lot or have a complicated inner model. Tetlock's super forecasters were gamblers and weathermen.
I think this only holds if fine tunes are composable, which as far as I can tell they aren't
Anecdotally, a lot of people are using mergekit to combine fine tunes
it feels less surgical than a single direction everywher
Agreed, it seems less elegant, But one guy on huggingface did a rough plot the cross correlation, and it seems to show that the directions changes with layer https://huggingface.co/posts/Undi95/318385306588047#663744f79522541bd971c919. Although perhaps we are missing something.
Note that you can just do torch.save(FILE_PATH, model.state_dict()) as with any PyTorch model.
omg, I totally missed that, thanks. Let me know if I missed anything else, I just want to learn.
The older versions of the gist are in transformerlens, if anyone wants those versions. In those the interventions work better since you can target resid_pre, redis_mid, etc.
If anyone wants to try this on llama-3 7b, I converted the collab to baukit, and it's available here.
So I ran a quick test (running llama.cpp perplexity command on wiki.test.raw )
- base_model (Meta-Llama-3-8B-Instruct-Q6_K.gguf): PPL = 9.7548 +/- 0.07674
- steered_model (llama-3-8b-instruct_activation_steering_q8.gguf): 9.2166 +/- 0.07023
So perplexity actually lowered, but that might be because the base model I used was more quantized. However, it is moderate evidence that the output quality decrease from activation steering is lower than that from Q8->Q6 quantisation.
I must say, I am a little surprised by what seems to be the low cost of activation editing. For context, many of the Llama-3 finetunes right now come with a measurable hit to output quality. Mainly because they are using worse fine tuning data, than the data llama-3 was originally fine tuned on.
maintaining model coherence
To determine this, I believe we would need to demonstrate that the score on some evaluations remains the same. A few examples don't seem sufficient to establish this, as it is too easy to fool ourselves by not being quantitative.
I don't think DanH's paper did this either. So I'm curious, in general, whether these models maintain performance, especially on measures of coherence.
In the open-source community, they show that modifications retain, for example, the MMLU and HumanEval score.
This is pretty good. It has a lot in it, being a grab bag of things. I particularly enjoyed the scalable oversight sections which succinctly explained debate, recursive reward modelling etc. There were also some gems I hadn't encountered before, like the concept of training out agentic behavior by punishing side-effects.
If anyone wants the HTML version of the paper, it is here.
Maybe our culture fits our status-seeking surprisingly well because our culture was designed around it.
We design institutions to channel and utilize our status-seeking instincts. We put people in status conscious groups like schools, platoons, or companies. There we have ceremonies and titles that draw our attention to status.
And this works! Ask yourself, is it more effective to educate a child individually or in a group of peers? The latter. Is it easier to lead a solitary soldier or a whole squad? The latter. Do people seek a promotion or a pay rise? Both, probably. The fact is, that people are easier to guide when in large groups, and easier to motivate with status symbols.
From this perspective, our culture and inclination for seeking status have developed in tandem, making it challenging to determine which influences the other more. However, it appears that culture progresses more rapidly than genes, suggesting that culture conforms to our genes, rather than the reverse.
Another perspective: Sometimes our status seeking is nonfunctional and therefore nonaligned. For example we also waste a lot of effort on status, which seems like a nonfunctional drive. People will compete for high status professions like musician, streamer, celebrity and most will fail, which makes it seem like an unwise investment of time. This seems misaligned, as it's not adaptive.
would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.
It seems that 1) when extrapolating to new situations 2) if you add a term to decay the relevance of old information (pretty standard in RL) 3) or you add a minimum bounds to uncertainty then it would remain deferential.
In other words, it doesn't seem like an unsolvable problem, just an open question. But every other alignment agenda also has numerous open questions. So why the hostility.
Academia and LessWrong are two different groups, which have different cultures and jargon. I think they may be overly skeptical towards each other's work at times.
It's worth noting though that many of the nice deferential properties may appear in other value modelling techniques (like recursive reward modelling at OpenAI).
A nice introductory essay, seems valuable for entrants.
There quite a few approaches to alignment beyond CIRL and Value Lear ing. https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety