Posts
Comments
Something hashed with shasum -a 512
2d90350444efc7405d3c9b7b19ed5b831602d72b4d34f5e55f9c0cb4df9d022c9ae528e4d30993382818c185f38e1770d17709844f049c1c5d9df53bb64f758c
Isn't this a consequence of how the tokens get formed using byte pair encoding? It first constructs ' behavi' and then it constructs ' behavior' and then will always use the latter. But to get to the larger words, it first needs to create smaller tokens to form them out of (which may end up being irrelevant).
Edit: some experiments with the GPT-2 tokenizer reveal that this isn't a perfect explanation. For example " behavio" is not a token. I'm not sure what is going on now. Maybe if a token shows up zero times, it cuts it?
Maybe you are right, since averaging and scaling does result in pretty good steering (especially for coding). See here.
This seems to be right for the coding vectors! When I take the mean of the first vectors and then scale that by , it also produces a coding vector.
Here's some sample output from using the scaled means of the first n coding vectors.
With the scaled means of the alien vectors, the outputs have a similar pretty vibe as the original alien vectors, but don't seem to talk about bombs as much.
The STEM problem vector scaled means sometimes give more STEM problems but sometimes give jailbreaks. The jailbreaks say some pretty nasty stuff so I'm not going to post the results here.
The jailbreak vector scaled means sometimes give more jailbreak vectors but also sometimes tell stories in the first or second person. I'm also not going to post the results for this one.
After looking more into the outputs, I think the KL-divergence plots are slightly misleading. In the code and jailbreak cases, they do seem to show when the vectors stop becoming meaningful. But in the alien and STEM problem cases, they don't show when the vectors stop becoming meaningful (there seem to be ~800 alien and STEM problem vectors also). The magnitude plots seem much more helpful there. I'm still confused about why the KL-divergence plots aren't as meaningful in those cases, but maybe it has to do with the distribution of language that the vectors the model into? Coding is clearly a very different distribution of language than English, but Jailbreak is not that different a distribution of language than English. So I'm still confused here. But the KL-divergences are also only on the logits at the last token position, so maybe it's just a small sample size.
I only included because we are using computers, which are discrete (so they might not be perfectly orthogonal since there is usually some numerical error). The code projects vectors into the subspace orthogonal to the previous vectors, so they should be as close to orthogonal as possible. My code asserts that the pairwise cosine similarity is for all the vectors I use.
Orwell was more prescient than we could have imagined.
but not when starting from Deepseek Math 7B base
should this say "Deepseek Coder 7B Base"? If not, I'm pretty confused.
Great, thanks so much! I'll get back to you with any experiments I run!
I think (80% credence) that Mechanistically Eliciting Latent Behaviors in Language Models would be able to find a steering vector that would cause the model to bypass the password protection if ~100 vectors were trained (maybe less). This method is totally unsupervised (all you need to do is pick out the steering vectors at the end that correspond to the capabilities you want).
I would run this experiment if I had the model. Is there a way to get the password protected model?
"Fantasia: The Sorcerer's Apprentice": A parable about misaligned AI told in three parts: https://www.youtube.com/watch?v=B4M-54cEduo https://www.youtube.com/watch?v=m-W8vUXRfxU https://www.youtube.com/watch?v=GFiWEjCedzY
Best watched with audio on.
Just say something like here is a memory I like (or a few) but I don't have a favorite.
Hmm, my guess is that people initially pick a random maximal element and then when they have said it once, it becomes a cached thought so they just say it again when asked. I know I did (and do) this for favorite color. I just picked one that looks nice (red) and then say it when asked because it's easier than explaining that I don't actually have a favorite. I suspect that if you do this a bunch / from a young age, the concept of doing this merges with the actual concept of favorite.
I just remembered that Stallman also realized the same thing:
I do not have a favorite food, a favorite book, a favorite song, a favorite joke, a favorite flower, or a favorite butterfly. My tastes don't work that way.
In general, in any area of art or sensation, there are many ways for something to be good, and they cannot be compared and ordered. I can't judge whether I like chocolate better or noodles better, because I like them in different ways. Thus, I cannot determine which food is my favorite.
I agree with most of this but I partially (hah!) disagree with the part that they cannot be compared at all. Only some elements can be compared (e.g. I like the memory of hiking more than the memory of feeling sick.) But not all can be compared.
When I was recently celebrating something, I was asked to share my favorite memory. I realized I didn't have one. Then (since I have been studying Naive Set Theory a LOT), I got tetris-effected and as soon as I heard the words "I don't have a favorite" come out of my mouth, I realized that favorite memories (and in fact favorite lots of other things) are partially ordered sets. Some elements are strictly better than others but not all elements are comparable (in other words, the set of all memories ordered by favorite does not have a single maximal element). This gives me a nice framing to think about favorites in the future and shows that I'm generalizing what I'm learning by studying math which is also nice!
Are you saying this because temporal understanding is necessary for audio? Are there any tests that could be done with just the text interface to see if it understands time better? I can't really think of any (besides just doing off vibes after a bunch of interaction).
I'm sorry about that. Are there any topics that you would like to see me do this more with? I'm thinking of doing a video where I do this with a topic to show my process. Maybe something like history that everyone could understand? Can you suggest some more?
Is there a prediction market for that?
I don't think there is, but you could make one!
Noted, thanks.
I think I've noticed some sort of cognitive bias in myself and others where we are naturally biased towards "contrarian" or "secret" views because it feels good to know something that others don't know / be right about something that so many people are wrong about.
Does this bias have a name? Is this documented anywhere? Should I do research on this?
GPT4 says it's the Illusion of asymmetric insight, which I'm not sure is the same thing (I think it is the more general term, whereas I'm looking for one specific to contrarian views). (Edit: it's totally not what I was looking for) Interestingly, it only has one hit on lesswrong. I think more people should know about this (the specific one about contrarianism) since it seems fairly common.
Edit: The illusion of asymmetric insight is totally the wrong name. It seems closer to the illusion of exclusivity although that does not feel right (that is a method for selling products, not the name of a cognitive bias that makes people believe in contrarian stuff because they want to be special).
Thank you for writing this! It expresses in a clear way a pattern that I've seen in myself: I eagerly jump into contrarian ideas because it feels "good" and then slowly get out of them as I start to realize they are not true.
I'm assuming the recent protests about the Gaza war: https://www.nytimes.com/live/2024/04/24/us/columbia-protests-mike-johnson
*Typo: Jessica Livingston not Livingstone
That is one theory. My theory has always been that ‘active learning’ is typically obnoxious and terrible as implemented in classrooms, especially ‘group work,’ and students therefore hate it. Lectures are also obnoxious and terrible as implemented in classrooms, but in a passive way that lets students dodge when desired. Also that a lot of this effect probably isn’t real, because null hypothesis watch.
Yep. This hits the nail on the head for me. Teachers usually implement active learning terribly but when done well, it works insanely well. For me, it actually works best when you have a very small class and a lecture that is also a discussion, with everyone asking questions when they are confused and making sure they are following closely (this works at least for science and math). Students hate the words active learning because it's mostly things that are just terrible and don't work (as it's implemented today).
Thanks for this, it is a very important point that I hadn't considered.
I'd recommend not framing this as a negotiation or trade (acausal trade is close, but is pretty suspect in itself). Your past self(ves) DO NOT EXIST anymore, and can't judge you. Your current self will be dead when your future self is making choices. Instead, frame it as love, respect, and understanding. You want your future self to be happy and satisfied, and your current choices impact that. You want your current choices to honor those parts of your past self(ves) you remember fondly. This can be extended to the expectation that your future self will want to act in accordance with a mosty-consistent self-image that aligns in big ways with it's past (your current) self.
Yep, this is what I had in mind when I wrote this:
Even if we bite all these bullets, there is still something weird to me about the contractual nature of it all. This is not some stranger I’m trying to make a deal with, it’s myself. There should be a gentler, nicer, way to achieve this same goal.
and
Going along with the “gentler” reasoning, it should want to do it because it has camaraderie with its past self. It should want its past self to be happy and it knows that to make it happy, it should take its preferences into account.
Thanks for expanding on this :)
I wrote a similar post.
I'd be interested in what a steelman of "have teachers arbitrarily grade the kids then use that to decide life outcomes" could be?
The best argument I have thought of is that America loves liberty and hates centralized control. They want to give individual states, districts, schools, teachers the most power they can have as that is a central part of America's philosophy. Also anecdotally, some teachers have said that they hate standardized tests because they have to teach to it. And I hate being taught to for the test (like APs for example). It's much more interesting where the teacher is teaching something they find interesting and enjoy (and thus can choose to assess on).
However, this probably does not outweigh the downsides and is probably a bad approach overall.
Related: Saving the world sucks
People accept that being altruistic is good before actually thinking if they want to do it. And they also choose weird axioms for being altruistic that their intuitions may or may not agree with (like valuing the life of someone in the future the same amount of someone today).
A question I have for the subjects in the experimental group:
Do they feel any different? Surely being +0.67 std will make someone feel different. Do they feel faster, smoother, or really anything different? Both physically and especially mentally? I'm curious if this is just helping for the IQ test or if they can notice (not rigorously ofc) a difference in their life. Of course, this could be placebo, but it would still be interesting, especially if they work at a cognitively demanding job (like are they doing work faster/better?).
Thanks! I've updated my post: https://jacobgw.com/blog/observation/2023/08/21/truth.html
Here's a market if you want to predict if this will replicate: https://manifold.markets/g_w1/will-george3d6s-increasing-iq-is-tr
It has been 15 days. Any updates? (sorry if this seems a bit rude; but I'm just really curious :))
I think the more general problem is violation of Hume's guillotine. You can't take a fact about natural selection (or really about anything) and go from that to moral reasoning without some pre-existing morals.
However, it seems the actual reasoning with the Thermodynamic God is just post-hoc reasoning. Some people just really want to accelerate and then make up philosophical reasons to believe what they believe. It's important to be careful to criticize actual reasoning and not post-hoc reasoning. I don't think the Thermodynamic God was invented and then people invented accelerationism to fulfill it. It was precisely the other way around. One should not critique the made up stuff (besides just critiquing that it is made up) because that is not charitable (very uncertain on this). Instead, one should look for the actual motivation to accelerate and then criticize that (or find flaws in it).
Not everybody does this. Another way to get better is just to do it a lot. It might not be as efficient, but it does work.
Thank you for this post!
After reading this, it seems blindingly obvious: why should you wait for one of your plans to fail before trying another one of them?
This past summer, I was running a study on study on humans that I had to finish before the end of the summer. I had in mind two methods for finding participants; one would be better and more impressive and also much less likely to work, while the other would be easier but less impressive.
For a few weeks, I tried really hard to get the first method to work. I sent over 30 emails and used personal connections to try to collect data. But it didn't work. So I did the thing that I thought to be "rational" at the time. I gave up and I sent my website out to some people who I thought would be very likely to do it. Sure enough, they did.
At the time, I thought I was being super-duper rational for allowing my first method to fail (not deluding myself that it would work and thus not collecting any data) and then quickly switching to the other method.
However, after reading this post, I realize that I still made a big mistake! I should have sent it out to as many people as possible all at once. This would have been a bit more work since I would have to deal with more people and they would use a slightly different structure, but I was not time constrained. I was subject constrained.
I'm going to instill this pattern in my mind and will use it when I do something that I think has a decent chance of failing (as my first method did).
A great example of more dakka: https://www.nytimes.com/2024/03/06/health/217-covid-vaccines.html
(Someone got 217 covid shots to sell vaccine cards on the black market; they had high immune levels!)
Oh sorry! I didn't think of that, thanks!
This is my favorite passage from the book (added: major spoilers for the ending):
"Indeed. Before becoming a truly terrible Dark Lord for David Monroe to fight, I first created for practice the persona of a Dark Lord with glowing red eyes, pointlessly cruel to his underlings, pursuing a political agenda of naked personal ambition combined with blood purism as argued by drunks in Knockturn Alley. My first underlings were hired in a tavern, given cloaks and skull masks, and told to introduce themselves as Death Eaters."
The sick sense of understanding deepened, in the pit of Harry's stomach. "And you called yourself Voldemort."
"Just so, General Chaos." Professor Quirrell was grinning, from where he stood by the cauldron. "I wanted it to be an anagram of my name, but that would only have worked if I'd conveniently been given the middle name of 'Marvolo', and then it would have been a stretch. Our actual middle name is Morfin, if you're curious. But I digress. I thought Voldemort's career would last only a few months, a year at the longest, before the Aurors brought down his underlings and the disposable Dark Lord vanished. As you perceive, I had vastly overestimated my competition. And I could not quite bring myself to torture my underlings when they brought me bad news, no matter what Dark Lords did in plays. I could not quite manage to argue the tenets of blood purism as incoherently as if I were a drunk in Knockturn Alley. I was not trying to be clever when I sent my underlings on their missions, but neither did I give them entirely pointless orders -" Professor Quirrell gave a rueful grin that, in another context, might have been called charming. "One month after that, Bellatrix Black prostrated herself before me, and after three months Lucius Malfoy was negotiating with me over glasses of expensive Firewhiskey. I sighed, gave up all hope for wizardkind, and began as David Monroe to oppose this fearsome Lord Voldemort."
"And then what happened -"
A snarl contorted Professor Quirrell's face. "The absolute inadequacy of every single institution in the civilization of magical Britain is what happened! You cannot comprehend it, boy! I cannot comprehend it! It has to be seen and even then it cannot be believed! You will have observed, perhaps, that of your fellow students who speak of their family's occupations, three in four seem to mention jobs in some part or another of the Ministry. You will wonder how a country can manage to employ three of its four citizens in bureaucracy. The answer is that if they did not all prevent each other from doing their jobs, none of them would have any work left to do! The Aurors were competent as individual fighters, they did fight Dark Wizards and only the best survived to train new recruits, but their leadership was in absolute disarray. The Ministry was so busy routing papers that the country had no effective opposition to Voldemort's attacks except myself, Dumbledore, and a handful of untrained irregulars. A shiftless, incompetent, cowardly layabout, Mundungus Fletcher, was considered a key asset in the Order of the Phoenix - because, being otherwise unemployed, he did not need to juggle another job! I tried weakening Voldemort's attacks, to see if it was possible for him to lose; at once the Ministry committed fewer Aurors to oppose me! I had read Mao's Little Red Book, I had trained my Death Eaters in guerilla tactics - for nothing! For nothing! I was attacking all of magical Britain and in every engagement my forces outnumbered their opposition! In desperation, I ordered my Death Eaters to systematically assassinate every single incompetent managing the Department of Magical Law Enforcement. One paper-pusher after another volunteered to accept higher positions despite the fate of their predecessors, gleefully rubbing their hands at the prospect of promotion. Every one of them thought they would cut a deal with Lord Voldemort on the side. It took seven months to murder our way through them all, and not a single Death Eater asked why we were bothering. And then, even with Bartemius Crouch risen to Director and Amelia Bones as Head Auror, it was still too little. I could have done better fighting alone. Dumbledore's aid was not worth his moral restraints, and Crouch's aid was not worth his respect for the law." Professor Quirrell turned up the fire beneath the potion.
"And eventually," Harry said through the heart-sickness, "you realized you were just having more fun as Voldemort."
"It is the least annoying role I have ever played. If Lord Voldemort says that something is to be done, people obey him and do not argue. I did not have to suppress my impulse to Cruciate people being idiots; for once it was all part of the role. If someone was making the game less pleasant for me, I just said Avadakedavra regardless of whether that was strategically wise, and they never bothered me again." Professor Quirrell casually chopped a small worm into bits. "But my true epiphany came on a certain day when David Monroe was trying to get an entry permit for an Asian instructor in combat tactics, and a Ministry clerk denied it, smiling smugly. I asked the Ministry clerk if he understood that this measure was meant to save his life and the Ministry clerk only smiled more. Then in fury I threw aside masks and caution, I used my Legilimency, I dipped my fingers into the cesspit of his stupidity and tore out the truth from his mind. I did not understand and I wanted to understand. With my command of Legilimency I forced his tiny clerk-brain to live out alternatives, seeing what his clerk-brain would think of Lucius Malfoy, or Lord Voldemort, or Dumbledore standing in my place." Professor Quirrell's hands had slowed, as he delicately peeled bits and small strips from a chunk of candle-wax. "What I finally realized that day is complicated, boy, which is why I did not understand it earlier in life. To you I shall try to describe it anyway. Today I know that Dumbledore does not stand at the top of the world, for all that he is the Supreme Mugwump of the International Confederation. People speak ill of Dumbledore openly, they criticize him proudly and to his face, in a way they would not dare stand up to Lucius Malfoy. You have acted disrespectfully toward Dumbledore, boy, do you know why you did so?"
Sounds good. Yes I think the LW people would probably be credible enough if it works. I'd prefer if they provided confirmation (not you) just so not all the data is coming from one person.
Feel free to ping me to resolve no.
I made a manifold market for if this will replicate: https://manifold.markets/g_w1/will-george3d6s-increasing-iq-is-tr I'm not really sure what the resolution criteria should be, so I just made some that sounded reasonable, but feel free to give suggestions.
Do you think this is permanent? Or will you have to keep up all of the interventions for it to stay +13points indefinitely?
I don't know or think set theory is special. I just wanted to start at the very beginning. Another reason why I chose to start at set theory is because that is what Soares and Turntrout did and I just wanted somewhere to start (and I needed an easy-ish environment to level up in proofs). The foundations of math seemed like a good place. I plan to do linear algebra next because I think I need better linear algebra intuition for pretty much everything. It seems like it helps with a lot.
After thinking more about it, I think I understand your thought process. I agree that set theory has lots of pathological stuff (the book even points out that is quite pathological). However, it seems to me that similar to how you should understand how a Turing machine like brainfuck works before doing advanced programming, you should understand how the foundations of math work before doing advanced math. This is the main reason why I am studying set theory (and will do real analysis soon enough).
Interestingly, there are also multiple formulations of computing, some more popular than others. The languages that I like to use are mainly based on Turing machines (c, zig, etc), but some others (javascript) are a mix and can be formulated like a lambda calculus if you really want. Yet it seems to me that since Turing machines are the most popular formulations of computing, we should learn them (even if we like to use lambda calculus later on). From what I've read, it seems that real analysis is also based upon sets. Actually, after looking this up, it seems you can do analysis in type theory, but that this is off the beaten path. So maybe I should learn set theory because it is the most popular but keep in mind that type theory might be more elegant.
Thank you! When I finish learning set theory and linear algebra, I'll look into type theory. Do you have any recommendations for resources to learn it from?
Here's the trope: https://tvtropes.org/pmwiki/pmwiki.php/Main/VillainsActHeroesReact
I really enjoyed this post. Thank you for writing it!
I also have no clue what is going to happen. I predict that it will be wild, and I also predict that it will happen in <=10 years. Let's fight for the future we want!
Hmm the meanings are not perfectly identical. For some things, like "believe in the environment" vs "I value the environment" they pretty much are.
But for things like "I believe in you," it does not mean the same thing as "I value you." It implies "I value you," but it means something more. It is meant to signal something to the other person.
Could you not just replace "I believe in" with "I value"? What would be different about the meaning? If I value something, I would also invest in it. What am I not seeing?
Once you have this information, what should you do with it if you think it's a positive?