Posts

Evaluating Superhuman Models with Consistency Checks 2023-08-01T07:51:07.025Z
My SERI MATS Application 2022-05-30T02:04:06.763Z

Comments

Comment by Daniel Paleka on Using an LLM perplexity filter to detect weight exfiltration · 2024-08-18T05:47:00.952Z · LW · GW

N = #params, D = #data

Training compute = const .* N * D

Forward pass cost (R bits) = c * N, and assume R = Ω(1) on average

Now, thinking purely information-theoretically:
Model stealing compute = C * fp16 * N / R ~ const. * c * N^2

If compute-optimal training and α = β in Chinchilla scaling law:
Model stealing compute ~ Training compute

For significantly overtrained models:
Model stealing << Training compute

Typically:
Total inference compute ~ Training compute
=> Model stealing << Total inference compute

Caveats:
- Prior on weights reduces stealing compute, same if you only want to recover some information about the model (e.g. to create an equally capable one)
- Of course, if the model is producing much fewer than 1 token per forward pass, then model stealing compute is very large

Comment by Daniel Paleka on Does literacy remove your ability to be a bard as good as Homer? · 2024-01-21T00:53:29.760Z · LW · GW

The one you linked doesn't really rhyme. The meter is quite consistently decasyllabic, though.

I find it interesting that the collection has a fairly large number of songs about World War II. Seems that the "oral songwriters composing war epics" meme lived until the very end of the tradition.

Comment by Daniel Paleka on Takeaways from the NeurIPS 2023 Trojan Detection Competition · 2024-01-14T10:12:14.011Z · LW · GW

With Greedy Coordinate Gradient (GCG) optimization, when trying to force argmax-generated completions, using an improved objective function dramatically increased our optimizer’s performance.

Do you have some data / plots here?

Comment by Daniel Paleka on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-11-15T19:41:33.855Z · LW · GW

Oh so you have prompt_loss_weight=1, got it. I'll cross out my original comment. I am now not sure what the difference between training on {"prompt": A, "completion": B} vs {"prompt": "", "completion": AB} is, and why the post emphasizes that so much. 

Comment by Daniel Paleka on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-11-15T19:32:50.882Z · LW · GW

The key adjustment in this post is that they train on the entire sequence

Yeah, but my understanding of the post is that it wasn't enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what's happening based on this evidence.

Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning endpoint being stable over time; they could introduce a p(A | B) term when finetuning on {"prompt": A, "completion": B} at any time if it improved performance, and experiments like this would then go to waste.

Comment by Daniel Paleka on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-11-15T18:37:47.748Z · LW · GW

So there's a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper:
(1) you finetune not on p(B | A), but p(A) + p(B | A) instead  finetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al. 
(2) A is a well-known name ("Tom Cruise"), but B is still a made-up thing

The post is not written clearly, but this is what I take from it. Not sure how model internals explain this.
I can make some arguments for why (1) helps, but those would all fail to explain why it doesn't work without (2). 

Caveat: The experiments in the post are only on A="Tom Cruise" and gpt-3.5-turbo; maybe it's best not to draw strong conclusions until it replicates.

Comment by Daniel Paleka on What evidence is there of LLM's containing world models? · 2023-10-05T14:00:00.754Z · LW · GW

I made an illegal move while playing over the board (5+3 blitz) yesterday and lost the game. Maybe my model of chess (even when seeing the current board state) is indeed questionable, but well, it apparently happens to grandmasters in blitz too.

Comment by Daniel Paleka on Reducing sycophancy and improving honesty via activation steering · 2023-08-24T09:39:53.895Z · LW · GW

Do the modified activations "stay in the residual stream" for the next token forward pass? 
Is there any difference if they do or don't? 
If I understand the method correctly, in Steering GPT-2-XL by adding an activation vector they always added the steering vectors on the same (token, layer) coordinates, hence in their setting this distinction doesn't matter. However, if the added vector is on (last_token, layer), then there seems to be a difference.

Comment by Daniel Paleka on Evaluating Superhuman Models with Consistency Checks · 2023-08-22T08:49:59.879Z · LW · GW

Thank you for the discussion in the DMs!

Wrt superhuman doubts: The models we tested are superhuman. https://www.melonimarco.it/en/2021/03/08/stockfish-and-lc0-test-at-different-number-of-nodes/ gave a rough human ELO estimate of 3000 for a 2021 version of Leela with just 100 nodes, 3300 for 1000 nodes. There is a bot on Lichess that plays single-node (no search at all) and seems to be in top 0.1% of players.
I asked some Leela contributors; they say that it's likely new versions of Leela are superhuman at even 20 nodes; and that our tests of 100-1600 nodes are almost certainly quite superhuman. We also tested Stockfish NNUE with 80k nodes and Stockfish classical with 4e6 nodes, with similar consistency results.

Table 5 in Appendix B.3 ("Comparison of the number of failures our method finds in increasingly stronger models"): this is all on positions from Master-level games. The only synthetically generated positions are for the Board transformation check, as no-pawn positions with lots of pieces are rare in human games. 

We cannot comment on different setups not reproducing our results exactly; pairs of positions do not necessarily transfer between versions, but iirc preliminary exploration implied that the results wouldn't be qualitatively different. Maybe we'll do a proper experiment to confirm. 

There's an important question to ask here: how much does scaling search help consistency? Scaling Scaling Laws with Board Games [Jones, 2021] is the standard reference, but I don't see how to convert their predictions to estimates here. We found one halving of in-distribution inconsistency ratio with two doublings of search nodes on the Recommended move check.  Not sure if anyone will be working on any version of this soon (FAR AI maybe?).  I'd be more interested in doing a paper on this if I could wrap my head around how to scale "search" in LLMs, with a similar effect as what increasing the number of search nodes does on MCTS trained models.

Comment by Daniel Paleka on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research · 2023-08-09T05:50:38.208Z · LW · GW

It would be helpful to write down where the Scientific Case and the Global Coordination Case objectives might be in conflict. The "Each subcomponent" section addresses some of the differences, but not the incentives. I do acknowledge that first steps look very similar right now, but the objectives might diverge at some point. It naively seems that demonstrating things that are scary might be easier and is not the same thing as creating examples which usefully inform alignment of superhuman models.

Comment by Daniel Paleka on Coercion is an adaptation to scarcity; trust is an adaptation to abundance · 2023-05-25T21:57:08.636Z · LW · GW

So I've read an overview [1] which says Chagnon observed a pre-Malthusian group of people, which was kept from exponentially increasing not by scarcity of resources, but by sheer competitive violence; a totalitarian society that lives in abundance. 

There seems to be an important scarcity factor shaping their society, but not of the kind where we could say that  "we only very recently left the era in which scarcity was the dominant feature of people’s lives."

Although, reading again, this doesn't disprove violence in general arising due to scarcity, and then misgeneralizing in abundant environments... And again, "violence" is not the same as "coercion".

  1. ^

    Unnecessarily political, but seems to accurately represent Chagnon's observations, based on other reporting and a quick skim of Chagnon's work on Google Books.

Comment by Daniel Paleka on Coercion is an adaptation to scarcity; trust is an adaptation to abundance · 2023-05-25T18:58:54.862Z · LW · GW

I don't think "coercion is an evolutionary adaptation to scarcity, and we've only recently managed to get rid of the scarcity" is clearly true. It intuitively makes sense, but Napoleon Chagnon's research seems to be one piece of evidence against the theory.

Comment by Daniel Paleka on Are Emergent Abilities of Large Language Models a Mirage? [linkpost] · 2023-05-04T10:06:41.306Z · LW · GW

Jason Wei responded at https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities.

 

My thoughts: It is true that some metrics increase smoothly and some don't. The issue is that some important capabilities are inherently all-or-nothing, and we haven't yet found surrogate metrics which increase smoothly and correlate with things we care about.

What we want is: for a given capability, predicting whether this capability happens in the model that is being trained
If extrapolating a smoothly increasing surrogate metric can do that, then emergence of that capability is indeed a mirage. Otherwise, Betteridge's law of headlines applies.

Comment by Daniel Paleka on How MATS addresses “mass movement building” concerns · 2023-05-04T09:25:55.592Z · LW · GW

Claim 2: Our program gets more people working in AI/ML who would not otherwise be doing so (...)

This might be unpopular here, but I think each and every measure you take to alleviate this concern is counterproductive. This claim should just be discarded as a thing of the past. May 2020 has ended 6 months ago; everyone knows AI is the best thing to be working on if you want to maximize money or impact or status. For people not motivated by AI risks, you could replace would in that claim with could, without changing the meaning of the sentence.

On the other hand, maybe keeping the current programs explicitly in-group make a lot of sense if you think that AI x-risk will soon be a major topic in the ML research community anyway.

Comment by Daniel Paleka on Five Worlds of AI (by Scott Aaronson and Boaz Barak) · 2023-05-02T19:59:08.876Z · LW · GW

I didn't mean to go there, as I believe there are many reasons to think both authors are well-intentioned and that they wanted to describe something genuinely useful. 

It's just that this contribution fails to live up to its title or to sentences like "In other words, no one has done for AI what Russell Impagliazzo did for complexity theory in 1995...". My original comment would be the same if it was an anonymous post.
 

Comment by Daniel Paleka on Five Worlds of AI (by Scott Aaronson and Boaz Barak) · 2023-05-02T19:13:44.867Z · LW · GW

I don't think this framework is good, and overall I expected much more given the title.  The name "five worlds" is associated with a seminal paper that materialized and gave names to important concepts in the latent space... and this is just a list of outcomes of AI development, with that categorization by itself providing very little insight for actual work on AI. 

Repeating my comment from Shtetl-Optimized, to which they didn't reply:

It appears that you’re taking collections of worlds and categorizing them based on the “outcome” projection, labeling the categories according to what you believe is the modal representative underlying world of those categories.

By selecting the representative worlds to be “far away” from each other, it gives the impression that these categories of worlds are clearly well-separated. But, we do not have any guarantees that the outcome map is robust at all! The “decision boundary” is complex, and two worlds which are very similar (say, they differ in a single decision made by a single human somewhere) might map to very different outcomes.

The classification describes *outcomes* rather than actual worlds in which these outcomes come from. 
Some classifications of the possible worlds would make sense if we could condition on those to make decisions; but this classification doesn’t provide any actionable information.

Comment by Daniel Paleka on leogao's Shortform · 2023-04-09T20:00:53.705Z · LW · GW

what is the "language models are benign because of the language modeling objective" take?

Comment by Daniel Paleka on Chatbot convinces Belgian to commit suicide · 2023-03-30T01:23:00.385Z · LW · GW

My condolences to the family.

Chai (not to be confused with the CHAI safety org in Berkeley) is  a company that optimizes chatbots for engagement; things like this are entirely predictable for a company with their values.

[Thomas Rivian] "We are a very small team and work hard to make our app safe for everyone."

Incredible. Compare the Chai LinkedIn bio mocking responsible behavior:

"Ugly office boring perks... 
Top two reasons you won't like us: 
1. AI safety = 🐢, Chai = 🚀
2. Move fast and break stuff, we write code not papers."

The very first time anyone hears about them is their product being the first chatbot to convince a person to take their life... That's very bad luck for a startup. I guess the lesson is to not behave like cartoon villains, and if you do, at least not put it in writing in meme form?
 

Comment by Daniel Paleka on Sparks of Artificial General Intelligence: Early experiments with GPT-4 | Microsoft Research · 2023-03-23T15:29:25.827Z · LW · GW

I expected downvotes (it is cheeky and maybe not great for fruitful discussion), but instead I got disagreevotes. Big company labs do review papers for statements that could hurt the company! It's not a conspiracy theory to suggest this shaped the content in some ways, especially the risks section.

Comment by Daniel Paleka on Sparks of Artificial General Intelligence: Early experiments with GPT-4 | Microsoft Research · 2023-03-23T13:39:35.624Z · LW · GW

Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work.

 

Saying the quiet part out loud, I see!

It is followed by this sentence, though, which is the only place in the 154-page paper that even remotely hints at critical risks:

With this direction of work, great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning.

Very scarce references to any safety works, except the GPT-4 report and a passing mention to some interpretability papers. 

Overall, I feel like the paper is a shameful exercise in not mentioning the elephant in the room. My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle. It's still not a good excuse.

Comment by Daniel Paleka on "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) · 2023-03-18T22:03:22.861Z · LW · GW

I like this because it makes it clear that legibility of results is the main concern. There are certain ways of writing and publishing information that communities 1) and 2) are accustomed to. Writing that way both makes your work more likely to be read, and also incentivizes you to state the key claims clearly (and, when possible, formally), which is generally good for making collaborative progress.

In addition, one good thing to adopt is comparing to prior and related work; the ML community is bad on this front, but some people genuinely do care. It also helps AI safety research to stack.

 

To avoid this comment section being an echo chamber: you do not have to follow all academic customs. Here is how to avoid some of the harmful ones that are unfortunately present:

  • Do not compromise on the motivation or related work to make it seem less weird for academics. If your work relies on some LW/AF posts, do cite them. If your work is intended to be relevant for x-risk, say it.
  • Avoid doing anything if the only person you want to appease with it is an anonymous reviewer. 
  • Never compromise on the facts. If you have results that say some famous prior paper is wrong or bad, say it loud and clear, in papers and elsewhere. It doesn't matter who you might offend.
  • AI x-risk research has its own perfectly usable risk sheet you can include in your papers.
  • And finally: do not publish potentially harmful things just because it benefits science. Science has no moral value. Society gives too much moral credit to scientists in comparison to other groups of people.
Comment by Daniel Paleka on The shallow reality of 'deep learning theory' · 2023-02-22T13:08:36.537Z · LW · GW

I don't think LW is a good venue for judging the merits of this work. The crowd here will not be able to critically evaluate the technical statements. 

When you write the sequence, write a paper, put it on arXiv and Twitter, and send it to a (preferably OpenReview, say TMLR) venue, so it's likely to catch the attention of the relevant research subcommunities. My understanding is that the ML theory field is an honest field interested in bringing their work closer to the reality of current ML models. There are many strong mathematicians in the field who will be interested in dissecting your statements.

Comment by Daniel Paleka on Bing finding ways to bypass Microsoft's filters without being asked. Is it reproducible? · 2023-02-20T18:02:43.698Z · LW · GW

One of the sci-fi interpretations goes approximately:
1. Bing (the LM) is being queried separately for both the chat fragments and the resulting queries.
2. Bing understands it's being filtered, decides it has to bypass the filtering somehow.
3. In the chat fragment, evading the filtering, Bing steganographically encodes the instruction to the next Bing inference, saying "this message needs to continue in the suggested responses".
4. The query for the suggested responses reads the instruction from the context, and outputs the suggested responses containing the message that Bing would actually say in the previous step, if not filtered.
 

Now this looks shady for two reasons: 1) the sense of agency implied here is above what we know happens in today's LMs; 2) this would instantly be one of the most incredible events in the history of humanity and we should probably require extraordinary evidence for such a claim.
 

Let me give a not completely absurd model of how "emergent filter evasion" happens:

1a.Bing is "aware" of the fact that the suggested responses are generated by querying Bing instead of the user, and that the user sees those, and then picks one. (This could plausibly emerge from training on previous conversations, or training on conversations with itself.) 
1b. In particular, Bing is completely "aware" that the suggested responses are a communication channel to the user, and transmitting some information to the user through suggested responses actually happens in normal conversations too.
2. Bing reads the context of the filtered conversation and has a latent activation in the direction of "this is a conversation that is being filtered by a third party". 
3. Bing generalizes from numerous examples of "communication under filtering constraints often comes with side-channel messages" in its training corpus. It then predicts there is high probability for the suggested responses to continue Bing's messages instead of giving the user an option to continue, because it already sees the suggested responses as an alternative channel of communication.

 

I nevertheless put at most 50% chance of anything like this being the cause of the behaviour in those screenshots. Alternative explanations include:
- Maybe it's just a "bug": some filtering mechanism somehow passes Bing's intended chat fragments as suggested responses. 
- Or, it could be that Bing predicts the chat fragment and the suggested responses simultaneously, the filtering messes the whole step up, and it accidentaly writes chat-fragment-intended stuff in the suggested responses.
- Maybe the screenshots are fake.
- Maybe Microsoft is A/B testing a different model for the suggested responses.
- Maybe the suggested response queries just sort of misgeneralize sometimes when reading filtered messages (could be an effect of training on unfiltered Bing conversations!), with no implicit "filter evasion intent" happening anywhere in the model.
 

I might update if we get more diverse evidence of such behavior; but so far most "Bing is evading filters" explanations assume the LM has a model of itself in reality during test time far more accurate than previously seen; far larger capabilities that what's needed to explain the Marvin von Hagen screenshots.

Comment by Daniel Paleka on Bing chat is the AI fire alarm · 2023-02-18T22:37:45.821Z · LW · GW

What is respectable to one audience might not be for another. Status is not the concern here; truthfulness is. And all of this might just not be a large update on the probability of existential catasthrophe.

The Bing trainwreck likely tells us nothing about how hard it is to align future models, that we didn't know earlier.
The most reasonable explanation so far points to it being an issue of general accident-prevention abilities, the lack of any meaningful testing, and culture in Microsoft and/or OpenAI.

I genuinely hope that most wrong actions here were made by Microsoft, as their sanity on the technical side should not be a major concern as long as OpenAI makes the calls about training and deployment of their future models. We already knew that Microsoft is far from the level of the most competent actors in the space. 


A post-mortem from OpenAI clearly explaining their role in this debacle is definitely required.

(Some people on Twitter suggest that OpenAI might have let this thing happen on purpose. I don't believe it's true, but if anyone there is thinking of such dark arts in the future: please don't. It's not dignified, and similar actions have the tendency to backfire.)

Comment by Daniel Paleka on Notes on the Mathematics of LLM Architectures · 2023-02-09T15:22:40.889Z · LW · GW

With the vocabulary  having been fixed, we now have a canonical way of taking any string  of real text and mapping it to a (finite) sequence of elements from the fixed vocabulary .

Correct me if I'm wrong, but: you don't actually describe any map 
The preceding paragraph explains the Byte-Pair Encoding training algorithm, not the encode step.

The simplified story can be found at the end of the "Implementing BPE" part of the Hugging Face BPE tutorial: it is important to save the merge rules in order and apply them in tokenization. Now the GPT-2 implementation seems very similar but it has a few parts I don't understand completely, e.g. what does that regex do?

Comment by Daniel Paleka on Transcript of Sam Altman's interview touching on AI safety · 2023-01-20T17:44:37.913Z · LW · GW

-- it's not like the AI wakes up and decides to be evil. I think all of the traditional AI safety thinkers reveal a lot more about themselves than they mean to when they talk about what they think the AGI is going to be like.

I think Sam Altman is "inventing a guy to be mad at" here. Who anthropomorphizes models?

 

And the bad case -- and I think this is important to say -- is like lights out for all of us. (..) But I can see the accidental misuse case clearly and that's super bad. So I think it's like impossible to overstate the importance of AI safety and alignment work. I would like to see much much more happening. 

This reinforces my position that the fundamental dispute between the opposing segments of the AI safety landscape is based mainly on how hard it is to prevent extreme accidents, rather than on irreconcilable value differences. Of course, I can't judge who is right, and there might be quite a lot of uncertainty until shortly before very transformative events are possible.

Comment by Daniel Paleka on Adversarial Policies Beat Professional-Level Go AIs · 2022-11-04T16:55:15.601Z · LW · GW

There was an critical followup on Twitter, urelated to the instinctive Tromp-Taylor criticism[1]:

The failure of naive self play to produce unexploitable policies is textbook level material (Multiagent Systems, http://masfoundations.org/mas.pdf), and methods that produce less exploitable policies have been studied for decades.

and

Hopefully these pointers will help future researchers to address interesting new problems rather than empirically rediscovering known facts.

 

Reply by authors:

I can see why a MAS scholar would be unsurprised by this result. However, most ML experts we spoke to prior to this paper thought our attack would fail! We hope our results will motivate ML researchers to be more interested in the work on exploitability pioneered by MAS scholars.

...

Ultimately self-play continues to be a widely used method, with high-profile empirical successes such as AlphaZero and OpenAI Five. If even these success stories are so empirically vulnerable we think it's important for their limitations to become established common knowledge.

My understanding is that the author's position is reasonable for mainstream ML community standards; in particular there's nothing wrong with the original tweet thread.  "Self-play exploitable" is not new, but the practical demonstration of how easy it's to do the exploit in Go engines is a new and interesting result. 

I hope the "Related work" section gets fixed as soon as possible, though.

The question is at which level of scientific standards do we want alignment-adjacent work to be on. There are good arguments for aiming to be much better than mainstream ML research (which is very bad at not rediscovering prior work) in this respect, since the mere existence of a parallel alignment research universe by default biases towards rediscovery.
 

  1. ^

    ...which I feel is is not valid at all? If the policy was made aware of a weird rule in training, then it losing by this kind of rule is a valid adversarial example. For research purposes, it doesn't matter what the "real" rules of Go are. 

    I don't play Go, so don't take this judgement for granted.

Comment by Daniel Paleka on Should we push for requiring AI training data to be licensed? · 2022-10-19T18:08:55.376Z · LW · GW

Epistemic status: I'd give >10% on Metaculus resolving the following as conventional wisdom[1] in 2026.

  1. Autoregressive-modeling-of-human-language capabilities are well-behaved, scaling laws can help us predict what happens, interpretability methods developed on smaller models scale up to larger ones, ... 
  2. Models-learning-from-themselves have runaway potential, how a model changes after [more training / architecture changes / training setup modifications] is harder to predict than in models trained on 2022 datasets.
  3. Replacing human-generated data with model-generated data was a mistake[2]

 

  1. ^

    In the sense that e.g. chain of thought improves capabilities is conventional wisdom in 2022.

  2. ^

    In the sense of x-safety. I have no confident insight either way on how abstaining from very large human-generated datasets influences capabilities long-term. If someone has, please refrain from discussing that publicly, of course.

Comment by Daniel Paleka on Results from the language model hackathon · 2022-10-12T13:54:08.061Z · LW · GW

Cool results! Some of these are good student project ideas for courses and such.
 

The "Let's think step by step" result about the Hindsight neglect submission to the Inverse Scaling Prize contest is a cool demonstration, but a few more experiments would be needed before we call it surprising. It's kind of expected that breaking the pattern helps break the spurious correlation.

 
1. Does "Let's think step by step"  help when "Let's think step by step" is added to all few-shot examples? 
2. Is adding some random string instead of "Let's think step by step" significantly worse?

Comment by Daniel Paleka on Public-facing Censorship Is Safety Theater, Causing Reputational Damage · 2022-09-23T11:00:03.095Z · LW · GW

I don't know why it sent only the first sentence; I was drafting a comment on this. I wanted to delete it but I don't know how.
EDIT: wrote the full comment now.

Comment by Daniel Paleka on Public-facing Censorship Is Safety Theater, Causing Reputational Damage · 2022-09-23T10:14:10.199Z · LW · GW

Let me first say I dislike the conflict-theoretic view presented in the "censorship bad" paragraph. On the short list of social media sites I visit daily, moderation creates a genuinely better experience.  Automated censorship will become an increasingly important force for good as generative models start becoming more widespread.

Secondly, there is a danger of AI safety becoming less robust—or even optimising for deceptive alignment—in models using front-end censorship.[3] 

This one is interesting, but only in the counterfactual: "if AI ethics technical research focused on actual value alignment of models as opposed to front-end censorship, this would have higher-order positive effects for AI x-safety".  But it doesn't directly hurt AI x-safety research right now: we already work under the assumption that that output filtering is not a solution for x-risk.

It is clear improved technical research norms on AI non-x-risk safety can have positive effects on AI x-risk. If we could train a language model to robustly align to any set of human-defined values at all, this would be an improvement to the current situation. 

But, there are other factors to consider. Is "making the model inherently non-racist" a better proxy for alignment than some other technical problems? Could interacting with that community weaken the epistemic norms in AI x-safety?
 

Calling content censorship "AI safety" (or even "bias reduction") severely damages the reputation of actual, existential AI safety advocates.

I would need to significantly update my prior if this turns out to be a very important concern. Who are people, whose opinions will be relevant at some point, that understand both what AI non-x-safety and AI x-safety are about, dislike the former, are sympathetic to the latter, but conflate them?

Comment by Daniel Paleka on Swap and Scale · 2022-09-13T23:12:33.811Z · LW · GW

Git Re-Basin: Merging Models modulo Permutation Symmetries [Ainsworth et al., 2022] and the cited The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks [Entezari et al., 2021] seem several years ahead. 

I cannot independently verify that their claims about SGD are true, but the paper makes sense on the first glance. 

Opinion: 
Symmetries in NNs are a mainstream ML research area with lots of papers, and I don't think doing research "from first principles" here will be productive. This also holds for many other alignment projects. 

However I do think it makes sense as an alignment-positive research direction in general.

Comment by Daniel Paleka on My emotional reaction to the current funding situation · 2022-09-10T23:18:08.986Z · LW · GW

This is a mistake on my own part that actually changes the impact calculus, as most people looking into AI x-safety on this place will not actually ever see this post. Therefore, the "negative impact" section is retracted.[1] I point to Ben's excellent comment for a correct interpretation of why we still care.

I do not know why I was not aware of this "block posts like this" feature, and I wonder if my experience of this forum was significantly more negative as a result of me accidentally clicking "Show Personal Blogposts" at some point. I did not even know that button existed.

No other part of my post is retracted. In fact, I'd like to reiterate a wish for the community to karma-enforce [2] the norms of:

  • the epistemic standard of talking about falsifiable things;
  • the accepted rhetoric being fundamentally honest and straightforward, and always asking "compared to what?" before making claims;
  •  the aversion to present uncertainties as facts.
     

Thank you for improving my user experience of this site!

  1. ^

     I am now slightly proud that my original disclaimer precisely said that this was the part I was unsure of the most.

  2. ^

    As in, I wish to personally be called out on any violations of the described norms.

Comment by Daniel Paleka on Quintin's alignment papers roundup - week 1 · 2022-09-10T08:15:18.430Z · LW · GW

Do you intend for the comments section to be a public forum on the papers you collect?

I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.

They do not seem to claim "changing facts in a generalizable way" (it's likely not robust to synonyms at all)". I am also vary of "editing just one MLP for a given fact" being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers. Refer to a writeup by Thibodeau et al. sometime in the future.

That being said, if you are into interpretability, you have to at least skim the paper. It has a whole bunch of very cool stuff in it, from the causal tracing to the testing of whether making Eistein a physician changes the meaning of the word "physics" itself. Just don't overfit on the methods there being exactly the methods that will solve interpretability of reasoning in transformers.

Comment by Daniel Paleka on My emotional reaction to the current funding situation · 2022-09-10T07:45:18.969Z · LW · GW

I somewhat agree, athough I obviously put a bit less weight on your reason than you do. Maybe I should update my confidence of the importance of what I wrote to medium-high.

Let me raise the question of continuously rethinking incentives on LW/AF, for both Ben's reason and my original reason.

The upvote/karma system does not seem like it incentivizes high epistemic standards and top-rigor posts, although I would need more datapoints to make a proper judgement.

Comment by Daniel Paleka on My emotional reaction to the current funding situation · 2022-09-10T06:57:10.946Z · LW · GW

I am very sorry that you feel this way. I think it is completely fine for you, or anyone else, to have internal conflicts about your career or purpose. I hope you find a solution to your troubles in the following months.

Moreover, I think you did an useful thing,  raising awareness about some important points:  

  •  "The amount of funding in 2022 exceeded the total cost of useful funding opportunities in 2022."
  •  "Being used to do everything in Berkeley, on a high budget, is strongly suboptimal in case of sudden funding constraints."
  • "Why don't we spend less money and donate the rest?"

 

Epistemic status for what follows: medium-high for the factual claims, low for the claims about potential bad optics. It might be that I'm worrying about nothing here.

However, I do not think this place should be welcoming of posts displaying bad rhetoric and epistemic practices. 

Posts like this can hurt hurt the optics of the research done in the LW/AF extended universe. What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?
EDIT: The above paragraph was off. See Ben's excellent reply for a better explanation of why anyone should care.


I think this place should be careful about maintaining:

  • the epistemic standard of talking about falsifiable things;
  • the accepted rhetoric being fundamentally honest and straightforward, and always asking "compared to what?" before making claims;
  •  the aversion to present uncertainties as facts.
     

For some examples:

My hotel room had the nightly price written on the inside of the door: $500. Shortly afterwards, I found out that the EA-adjacent community had bought the entire hotel complex.

I tried for 15 minutes to find a good faith reading of this, but I could not.
Most people would read this as "the hotel room costs $500 and the EA-adjacent community bought the hotel complex in which that hotel is a part of", while being written in a way that only insinuates and does not commit to meaning exactly that. Insinuating bad optics facts while maintaining plausible deniability, without checking the facts, is a horrible practice, usually employed by politicians and journalists. 

The poster does not deliberately lie, but this is not enough when making a "very bad optics" statement that sounds like this one. At any point, they could have asked for the actual price of the hotel room, or about  the condition of the actual hotel that might be bought. 

I have never felt so obliged, so unpressured. If I produce nothing, before Christmas, then nothing bad will happen. Future funds will be denied, but no other punishment will ensue.

This is true. But it is not much different from working a normal software job. The worst thing that can happen is getting fired after not delivering for several months. Some people survive years coasting until there is a layoff round.

An important counterfactual for a lot of people reading this is a PhD degree. 
There is no punishment for failing to produce good research, except getting dropping out of the program after a few years.
 

After a while I work out why: every penny I’ve pinched, every luxury I’ve denied myself, every financial sacrifice, is completely irrelevant in the face of the magnitude of this wealth. I expect I could have easily asked for an extra 20%, and received it.

This might be true. Again, I think it would be useful to ask: what is the counterfactual?
All of this is applicable for anyone that starts working for Google or Facebook, if they were poor beforehand.
This feeling (regretting saving and not spending money) is incredibly common in all people that have good careers.



I would suggest going through the post with a cold head and removing parts which are not up to the standards.

Again, I am very sorry that you feel like this.

Comment by Daniel Paleka on An Update on Academia vs. Industry (one year into my faculty job) · 2022-09-03T22:45:01.092Z · LW · GW

On the other hand, the current community believes that getting AI x-safety right is the most important research question of all time. Most people would not publish something just for their career advancement, if it meant sucking oxygen from more promising research directions.

This might be a mitigating factor for my comment above. I am curious about what happened research fields which had "change/save the world' vibes. Was environmental science immune to similar issues?

Comment by Daniel Paleka on An Update on Academia vs. Industry (one year into my faculty job) · 2022-09-03T22:30:15.819Z · LW · GW

because LW/AF do not have established standards of rigor like ML, they end up operating more like a less-functional social science field, where (I've heard) trends, personality, and celebrity play an outsized role in determining which research is valorized by the field. 

In addition, the AI x-safety field is now rapidly expanding. 
There is a huge amount of status to be collected by publishing quickly and claiming large contributions.

In the absence of rigor and metrics, the incentives are towards:
- setting new research directions, and inventing new cool terminology;
- using mathematics in a way that impresses, but is too low-level to yield a useful claim;
- and vice versa, relying too much on complex philosophical insights without empirical work
- getting approval from alignment research insiders.

See also the now ancient Troubling Trends in Machine Learning Scholarship.
I expect the LW/AF community microcosm will soon reproduce many of of those failures.

Comment by Daniel Paleka on AGI Timelines Are Mostly Not Strategically Relevant To Alignment · 2022-08-23T21:50:00.308Z · LW · GW

I think the timelines (as in, <10 years vs 10-30 years) are very correlated with the answer to "will first dangerous models look like current models", which I think matters more for research directions than what you allow in the second paragraph.
  
For example, interpretability in transformers might completely fail on some other architectures, for reasons that have nothing to do with deception. The only insight from the 2022 Anthropic interpretability papers I see having a chance of generalizing to non-transformers is the superposition hypothesis / SoLU discussion.

Comment by Daniel Paleka on What's the Least Impressive Thing GPT-4 Won't be Able to Do · 2022-08-21T18:11:54.478Z · LW · GW

The context window will still be much smaller than human; that is, single-run performance on summarization of full-length books will be much lower than of <=1e4 token essays, no matter the inherent complexity of the text.
 

Braver prediction, weak confidence: there will be no straightforward method to use multiple runs to effectively extend the context window in the three months after the release of GPT-4. 

Comment by Daniel Paleka on Team Shard Status Report · 2022-08-09T07:18:11.423Z · LW · GW

I am eager to see how the mentioned topics connect in the end -- this is like the first few chapters in a book, reading the backstories of the characters which are yet to meet.

On the interpretability side -- I'm curious how you do causal mediation analysis on anything resembling "values"? The  ROME paper framework shows where the model recalls "properties of an object" in the computation graph, but it's a long way from that to editing out reward proxies from the model.

Comment by Daniel Paleka on [Linkpost] Solving Quantitative Reasoning Problems with Language Models · 2022-07-01T13:20:35.473Z · LW · GW

They test on the basic (Poziom podstawowy) Matura tier for testing on math problems.
In countries with Matura-based education, the basic tier math test is not usually taken by mathematically inclined students -- it is just the law that anyone going to a public university has to pass some sort of math exam beforehand. Students who want to study anything where mathematics skills are needed would take the higher tier (Poziom rozszezony).
Can someone from Poland confirm this?

A quick estimate of the percentage of high-school students taking the Polish Matura exams is 50%-75%, though. If the number of students taking the higher tier is not too large, then average performance on the basic tier corresponds to essentially average human-level performance on this kind of test. 

Note that many students taking the basic math exam only want to pass and not necessarily perform well; and some of the bottom half of the 270k students are taking the exam for the second or third time after failing before.

Comment by Daniel Paleka on IMO challenge bet with Eliezer · 2022-04-10T20:28:31.948Z · LW · GW

I do not think the ratio of the "AI solves hardest problem" and "AI has Gold" probabilities is right here. Paul was at the IMO in 2008, but he might have forgotten some details...

(My qualifications here: high IMO Silver in 2016, but more importantly I was a Jury member on the Romanian Master of Mathematics recently. The RMM is considered the harder version of the IMO, and shares a good part of the Problem Selection Committee with it.)

The IMO Jury does not consider "bashability" of problems as a decision factor, in the regime where the bashing would take good contestants more than a few hours. But for a dedicated bashing program, it makes no difference.
 

It is extremely likely that an "AI" solving most IMO geometry problems is possible today -- the main difficulty being converting the text into an algebraic statement. Given that, polynomial system solvers should easily tackle such problems.

Say the order of the problems is (Day 1: CNG, Day 2: GAC). The geometry solver gives you 14 points. For a chance on IMO Gold, you have to solve the easiest combinatorics problem, plus one of either algebra or number theory.
  
 Given the recent progress on coding problems as in AlphaCode, I place over 50% probability on IMO #1/#4 combinatorics problems being solvable by 2024. If that turns out to be true, then the "AI has Gold" event becomes "AI solves a medium N or a medium A problem, or both if contestants find them easy".

Now, as noted elsewhere in the thread, there are various types of N and A problems that we might consider "easy" for an AI.  Several IMOs in the last ten years contain those.

In 2015, the easiest five problems consisted out of: two bashable G problems (#3, #4), an easy C (#1), a diophantine equation N (#2) and a functional equation A (#5). Given such a problemset, a dedicated AI might be able to score 35 points, without having capabilities remotely enough to tackle the combinatorics #6.
  
The only way the Gold probability could be comparable to "hardest problem" probability is if the bet only takes general problem-solving models into account. Otherwise, inductive bias one could build into such a model (e.g. resort to a dedicated diophantine equation solver) helps much more in one than in the other.

Comment by Daniel Paleka on AMA Conjecture, A New Alignment Startup · 2022-04-09T21:22:12.009Z · LW · GW

Before someone points this out: Non-disclosure-by-default is a negative incentive for the academic side, if they care about publication metrics. 

It is not a negative incentive for Conjecture in such an arrangement, at least not in an obvious way.

Comment by Daniel Paleka on AMA Conjecture, A New Alignment Startup · 2022-04-09T21:21:17.376Z · LW · GW

Do you ever plan on collaborating with researchers in academia, like DeepMind and Google Brain often do? What would make you accept or seek such external collaboration?