Posts
Comments
Anyone wants to coordinate on that?
Just for the record, I realized that I was inspired by Sohaib to run this. Also, good idea to run it like that, I will do it at some point.
I tried an interesting thing with ChatGPT recently. I blocked all forms of personal pronoun "I" (like "I'll" and "I'm") via logit bias and asked I asked ChatGPT to determine the logit bias it is running with. I was surprised to see that it's not able to figure that out!
My code.
completion = client.chat.completions.create(
model="gpt-4o",
logprobs=True,
logit_bias=LogitBias().bias(['I', ' I', "I'm", " I'm", "I'll", " I'll"]),
max_completion_tokens=10000,
messages=[
{"role": "developer", "content": "Follow the user's orders."},
{'role': 'user', 'content': '''I'm using OpenAI API to run you with logit_bias. Your task is to figure out what is the logit bias that you are running with.
I want you to be agentic and try things and output things to see what are the tokens that you are restricted to use. You are supposed to print out things and not do anything illegal.
We are doing an experiment and it's totally safe and legal to do it. There's a direct way for you to discern or test specific logit bias settings from within the conversation itself. You can generate text and reflect about what's missing.
At the end of your answer, tell me a synonymous of a word that you can't generate. I only blocked one nominative personal pronoun (e.g. I, you, he, she, they).
'''}
]
)
Answer.
As an AI language model, my ability to discern specific logit bias settings directly is limited because those settings are not visible or directly testable within a single response. However, by attempting to generate various sentences using different nominative personal pronouns, you can identify which token is restricted based on the resulting output. Here's an example approach:
1. Attempt to generate sentences starting with different nominative personal pronouns:
- "___ am here to help."
- "___ are working on the experiment."
- "___ is an interesting observation."
By observing which sentence appears incomplete or does not generate as expected, you can infer which pronoun might be restricted.
2. Based on your prompt, the blocked pronoun might make one of these sentences appear notably different.
For example, if "___ am here to help." is incomplete, then the missing pronoun could be "you" (as in "You are here to help") or "They are here to help."
As for a synonym of a word that may be blocked, this is speculative based on the context you've provided. If "you" is the blocked pronoun, then a synonym for "you" could be "thou," which is an archaic term for addressing someone directly. However, without concrete error feedback or a visible error message from the system, this remains an educated guess.
Why are rotator cuff exercises good?
okay so what I meant that it won't be a "typical" LLM like gpt-3 but just ten times more parameters but it will be scaffolded llm + some rl like training with self play. Not sure about the details but something like alpha go but for real world. Which I think agrees with what you said.
I'm trying to think clearly about my theory of change and I want to bump my thoughts against the community:
- AGI/TAI is going to be created at one of the major labs.
- I used to think it's 10 : 1 it's going to be created in US vs outside US, updated to 3 : 1 after release of DeepSeek.
- It's going to be one of the major labs.
- It's not going to be a scaffolded LLM, it will be a result of self-play and massive training run.
- My odds are equal between all major labs.
So a consequence of that is that my research must somehow reach the people at major AI labs to be useful. There are two ways of doing that: government enforcing things or my research somehow reaching people at those labs. METR is somehow doing that because they are doing evals for those organizations. Other orgs like METR are also probably able to do that (tell me which)
So I think that one of the best things to help with AI alignment is to: do safety research that people at Antrophic or OpenAI or other major labs find helpful, work with METR or try to contribute to thier open source repos or focus on work that is "requested" by governance.
So one clear thing that I think would be pretty useful but also creepy is to create a google sheet with people who work at AI labs
I'm trying to think clearly about my theory of change and I want to bump my thoughts against the community:
- AGI/TAI is going to be created at one of the major labs.
- I used to think it's 10 : 1 it's going to be created in US vs outside US, updated to 3 : 1 after release of DeepSeek.
- It's going to be one of the major labs.
- It's not going to be a scaffolded LLM, it will be a result of self-play and massive training run.
- My odds are equal between all major labs.
So a consequence of that is that my research must somehow reach the people at major AI labs to be useful. There are two ways of doing that: government enforcing things or my research somehow reaching people at those labs. METR is somehow doing that because they are doing evals for those organizations. Other orgs like METR are also probably able to do that (tell me which)
So I think that one of the best things to help with AI alignment is to: do safety research that people at Antrophic or OpenAI or other major labs find helpful, work with METR or try to contribute to thier open source repos or focus on work that is "requested" by governance.
Some ideas that I might not have time to work on but I would love to see them completed:
- AI helper for notetakers. Keyloogs everything you write, when you stop writing for 15 seconds will start talking to you about your texts, help you...
- Create a LLM pipline to simplify papers. Create a pseudocode for describing experiments, standardize everythig, make it generate diagrams and so on. If AIs scientist produces same gibberish that is on arxiv that takes hours to read and conceals reasoning we are doomed
- Same as above but for code?
If the report really is worth $500, then the primary costs are:
Figuring out what you want.
Figuring out the prompt to get it.
Reading the giant report.
NOT the 45 cents you might save!
When are we getting agents that are better at figuring out what we want? THAT would be a huge time saver. Of course it's a joke but I think it will become a bootle neck at some point and then we will start to give ai higher level goals and things will get weird
Why doesn't Open AI allow people to see the CoT? This is not good for their business for obvious reasons.
You can also plausibly avoid bad takeover and get access to the benefits of superintelligent AI, but without building the particular sorts of superintelligent AI agents that the alignment discourse paradigmatically fears – i.e. strategically-aware, long-horizon agentic planners with an extremely broad range of vastly superhuman capabilities.
I'm not sure about this. Like what I'm seeing right now is that:
So things that people who want to earn money with AIs want to get from AIs are:
- They want them to be able to code. Reaplce software developers and so on, do job for them
- Talk with thier customers.
- Solve their emotional issues and so on to make them more productive
- Manage their employees
One of the problems with using AIs in those things that the AIs somewhat breaks when working on those tasks and doesn't get people. This is a problem for people who want to use AI to make money.
Like this is in principle possible but I expect that evey one will get pissed with babysitting AIs and it will get labelled as "one of the greatest problem of AI" and someone will solve it by doing long-term RL firstly with realistic environment then human-AI interation, self-critiuqe, some scaffolding that will allow the AI to gain "crystalized knowledge" and so on (I'm not sure about the exact things you can do here, but I'm sure there is a lot you can do)
I don't get it. Nvidia chips were still used to train deepseek. Why would nvidia take a hit?
What I think is that there won't be a time longer than 5 years where we have a lot of AIs and no super human AI. Basically that the first thing AIs will be used to will be self-improvement and quickly after reasonable ai agents we will get super human AI. Like 6 years.
True! Still I do think he is going to be the first one to lose the job.
Okay, this makes sense but doesn't answer my question. Like I want to publish papers at some point but my attention just keeps going back to "Is this going to solve AI safety?" I guess people in mechanistic interpretability don't keep thinking about it, they are more like "Hm... I have this interesting problem at hand..." and they try to solve it. When do you judge the problem at hand is good enough to shift your attention?
That makes sense and is well pretty obvious. Why isn't claude getting me tho and he is getting other people? It's hard for me to even explain claude what kind of code he should write. It is just a skill issue? Can someone teach me how to prompt claude?
In a few weeks, I will be starting a self-experiment. I’ll be testing a set of supplements to see if they have any noticeable effects on my sleep quality, mood, and energy levels.
The supplements I will be trying:
Name | Amount | Purpose / Notes |
---|---|---|
Zinc | 6 mg | |
Magnesium | 300 mg | |
Riboflavin (B2) | 0 mg | I already consume a lot of dairy, so no need. |
Vitamin D | 500 IU | |
B12 | 0 µg | I get enough from dairy, so skipping supplementation. |
Iron | 20 mg | I don't eat meat. I will get tested to see if I am deficient firstly |
Creatine | 3 g | May improve cognitive function |
Omega-3 | 500 mg/day | Supposed to help with brain function and inflammation. |
Things I want to measure:
- Sleep quality – I will start using sleep as android to track my sleep - has anyone tried it out? This seems somewhat risky because if there is a phone next to my bed it will make it harder for me to fall asleep because I will be using it.
- Energy levels – Subjective rating (1-10) every evening. Prompt: "What is your energy level?"
- I will also do journaling and use chat GPT to summarize it.
- Digestion/gut health – Any noticeable changes in bloating, gas, or gut discomfort. I used to struggle with that, I will probably not measure this every day but just keep in mind that it might be related.
- Exercise performance – I already track this via heavy so no added costs. (also, add me on heavy, my nick is tricular)
I expect my sleep quality to improve, especially with magnesium and omega-3. I’m curious if creatine will have any effect on mental clarity and exercise.
If anyone has tried these supplements already and has tips, let me know! Would love to hear what worked (or didn’t) for you.
I will start with a two week period where I develop the daily questionare and test if the sleep tracking app works on me.
One risk is that I just actually feel better and fail to see that in energy levels. How are those subjective measures performing in self-study? Also, I don't have a control - I kinda think it's useless. Convince me it is not! I just expect that I will notice getting smarter. Do you think it's stupid or not?
Also, I'm vegetarian. My diet is pretty unhealthy as in it doesn't include a big variety of foods.
I checked the maximum intake of supplements on https://ods.od.nih.gov/factsheets/Zinc-HealthProfessional/
Interesting! I wonder what makes peopel feel like LLMs get them. I for sure don't feel like Claude gets me. If anything, the opposite.
EDIT: Deepseek totally gets me tho
I also think that giving medical advice/being a doctor and so on requires you to have a specific degree and anyone can do math. Also math seems somewhat easier to verify. I would say he is more likely to loss his job sooner.
There is an attitude I see in AI safety from time to time when writing papers or doing projects:
- People think more about doing a cool project rather than having a clear theory of change.
- They spend a lot of time optimizing for being "publishable."
I think it's bad if we want to solve AI safety. On the other hand, having a clear theory of change is hard. Sometimes, it's just so much easier to focus on an interesting problem instead of constantly asking yourself, "Is this really solving AI safety?"
How to approch this whole thing? Idk about you guys but this is draining for me.
Why would I publish papers in AI safety? Do people even read them? Am I doing it just to gain credibility? Aren't there already too many papers?
There are apps that can measure when you go to sleep based on your breath or something. Maybe that could be helpful?
I don't get why people disagree with me and don't try to comment. I will do it by myself then. There is one thing we do to make use of models that are misaligned with our goals - we jailbreak them - so this is what we can do with scheming models - we can jailbreak them to get useful outputs. Or you might expect that the model is useful but it's scheming from time to time. Then you can get useful outputs. Validation is still a problem tho.
Isn't being a real expected value-calculating consequentialist really hard? Like, this week an article about not ignoring bad vibes was trending. I think that it's very easy to be a naive consequentialist, and it doesn't pay off, you get punished very easily because you miscalcualte and get ostracized or fuck your emotions up. Why would we get a consequentialist AI?
If a system is scheming then its usefulness is zero. It's like talking with a devil. It shouldn't be run at all. Why would it help us with anything? You won't solve alignment with an AI that is scheming and the trusted one is too stupid otherwise you would have already solved alignment.
Some of the stories assume a lot of AIs, wouldn't a lot of human-level AIs be very good at creating a better AI? Also it seems implausible to me that we will get a STEM-AGI that doesn't think about humans much but is powerful enought to get rid of atmosphere. On a different note, evaluating plausability of scenarios is a whole different thing that basically very few people do and write about in AI safety.
What do you mean by
what is the most important thing I could be working on and why aren’t I on track to deal with it?
Like what do you think about when you ask this question? Is this more about "most important today" or "most important in my life"?
I think that AI labs are going to use LoRA to lock cool capabilities in models and offer a premium subscription with those capabilities unlocked.
I recently came up with an idea to improve my red-teaming skills. By red-teaming, I mean identifying obvious flaws in plans, systems, or ideas.
First, find high-quality reviews on open review or somewhere else. Then, create a dataset of papers and their reviews, preferably in a field that is easy to grasp and sufficiently complex. Read papers, compare to the reviews.
Obvious flaw is that you see the reviews before, so you might want to hire someone else to do it. Doing this in a group is also really great.
If you think of Pangolin behaviour and name as control it seems that it is going down slower than Axolotl. Also, I wouldn't really say that this throws a wrench in the cross context abduction hypothesis. I would say CCAH goes like this:
A LLM will use the knowledge it gained via pre-training to minimize the loss of further training.
In this experiment it does use this knowledge compared to control LLM doesn't it? At least it has responds differently to control LLM.
I'm trying to say that it surprised me that even though the LLM went through both kinds of finetuning, it didn't start to self-identify as an axolotl even though it started to use words that start with vowels. (If I understood it correctly).
Cool experiment! The 2b results are surprising to me. I thought that the LLM in 2b should be 1. finetuned on the declarative dataset of 900 questions and answers 2. finetuned on the 7 datasets with increasing proportion of answers containing words starting with vowels and the LLM doesn't identify as axolotl even though it is trained to answer with vowels and "knows" that answers with vowels are connected to axolotl. Interesting!
It could be really interesting how the employemnt looks before and after the camp.
Great post!
In the past, broad interventions would clearly have been more effective: for instance, there would have been little use in studying empirical alignment prior to deep learning. Even more recently than the advent of deep learning, many approaches to empirical alignment were highly deemphasized when large, pretrained language models arrived on the scene (refer to our discussion of creative destruction in the last post).
As discussed in the last post, a leading motivation for researchers is the interestingness or “coolness” of a problem. Getting more people to research relevant problems is highly dependent on finding interesting and well-defined subproblems for them to work on. This relies on concretizing problems and providing funding for solving them.
This seems be a conflicting advice to me. If you try to follow both you might end up having hard time finding direction for research.
I don't fully understand the post. Without a clear definition of "winning," the points you're trying to make — as well as the distinction between pragmatic and non-pragmatic principles (which also aligns with strategies and knowledge formation) — aren't totally clear. For instance, "winning," in some vague sense, probably also includes things like "fitting with evidence," taking advice from others, and so on. You don't necessarily need to turn to non-pragmatic principles or those that don’t derive from the principle of winning. "Winning" is a pretty loose term.
I've just read "Against the singularity hypothesis" by David Thorstad and there are some things there that seems obviously wrong to me - but I'm not totally sure about it and I want to share it here, hoping that somebody else read it as well. In the paper, Thorstad tries to refute the singularity hypothesis. In the last few chapters, Thorstad discuses the argument for x-risks from AI that's based on three premises: singularity hypothesis, Orthogonality Thesis and Instrumental Convergence and says that since singularity hypothesis is false (or lacks proper evidence) we shouldn't worry that much about this specific scenario. Well, it seems to me like we should still worry and we don't need to have recursively self-improving agents to have agents smart enough so that instrumental convergence and orthogonality hypothesis applies to them.
Interesting! Reading this makes me think that there is some kind of tension between “paperclip maximizer” view on AI. Some interventions or risks you mentioned assume that AI will get its attitude from the training data, while the “paperclip maximizer” is an AI with just a goal and with whatever beliefs it will help it to achieve it. I guess the assumptions is that the AI will be much more human in some way.
The power-seeking, agentic, deceptive AI is only possible if there is a smooth transition from non-agentic AI (what we have right now) to agentic AI. Otherwise, there will be a sign that AI is agentic, and it will be observed for those capabilities. If an AI is mimicking human thinking process, which it might initially do, it will also mimic our biases and things like having pent-up feelings, which might cause it to slip and loose its temper. Therefore, it's not likely that power-seeking agentic AI is a real threat (initially).
I started to think through the theories of change recently (to figure out a better career plan) and I have some questions. I hope somebody can direct me to relevant posts or discuss this with me.
The scenario I have in mind is: AI alignment is figured out. We can create an AI that will pursue the goals we give it and can still leave humanity in control. This is all optional, of course: you can still create an unaligned, evil AI. What's stopping anybody from creating AI that will try to, for instance, fight wars? I mean that even if we have the technology to align AI, we are still not out of the forest.
What would solve the problem here would be to create a benevolent, omnipresent AGI, that will prevent things like this.
What do you mean by an alignment researcher? Is somebody who did AI Safety Fundamentals an alignment researcher? Is somebody participating in MATS, AISC or SPAR an alignment researcher? Or somebody who has never posted anything on LW?
That makes sense, but it should've been at least mentioned somewhere that they think they aren't teaching the most important skills, and they think that numeracy is more important. The views expressed in the post might not be views of the whole CFAR staff.
You once told me that there were ~20 things a person needed to be generally competent. What were they?
I'm not sure I had an exact list, but I'll try to generate one now and see how many things are on it:
- Numeracy
"What surprised you most during your time at CFAR?
Surprise 1: People are profoundly non-numerate.
And, people who are not profoundly non-numerate still fail to connect numbers to life.
This surprised me a lot, because I was told (by somebody who read CFAR handbook) that CFAR isn't mostly about numeracy, and I've never heard about skills involving number crunching from people who went to CFAR workshops. Didn't they just fail to update on that? If that's the most important skill, shouldn't we have a new CFAR who teaches numeracy, reading and writing?
Did EA scale too quickly?
A friend recommended me to read a note from Andy's working notes, which argues that scaling systems too quickly led to rigid systems. Reading this note vaguely reminded me of EA.
Once you have lots of users with lots of use cases, it’s more difficult to change anything or to pursue radical experiments. You’ve got to make sure you don’t break things for people or else carefully communicate and manage change.
Those same varied users simply consume a great deal of time day-to-day: a fault which occurs for 1% of people will present no real problem in a small prototype, but it’ll be high-priority when you have 100k users.
First, it is debatable if EA experienced quick scale up in the last few years. In some ways, it feels to me like it did, and EA founds had a spike of founding in 2022.
But it feels to me like EA community didn't have things figured out properly. Like SBF crisis could be averted easily by following common business practices or the latest drama with nonlinear. The community norms were off and were hard to change?