Posts

Open consultancy: Letting untrusted AIs choose what answer to argue for 2024-03-12T20:38:03.785Z
Fabien's Shortform 2024-03-05T18:58:11.205Z
Notes on control evaluations for safety cases 2024-02-28T16:15:17.799Z
Protocol evaluations: good analogies vs control 2024-02-19T18:00:09.794Z
Toy models of AI control for concentrated catastrophe prevention 2024-02-06T01:38:19.865Z
A quick investigation of AI pro-AI bias 2024-01-19T23:26:32.663Z
Measurement tampering detection as a special case of weak-to-strong generalization 2023-12-23T00:05:55.357Z
Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem 2023-12-16T05:49:23.672Z
AI Control: Improving Safety Despite Intentional Subversion 2023-12-13T15:51:35.982Z
Auditing failures vs concentrated failures 2023-12-11T02:47:35.703Z
Some negative steganography results 2023-12-09T20:22:52.323Z
Coup probes: Catching catastrophes with probes trained off-policy 2023-11-17T17:58:28.687Z
Preventing Language Models from hiding their reasoning 2023-10-31T14:34:04.633Z
Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation 2023-10-23T16:37:45.611Z
Will early transformative AIs primarily use text? [Manifold question] 2023-10-02T15:05:07.279Z
If influence functions are not approximating leave-one-out, how are they supposed to help? 2023-09-22T14:23:45.847Z
Benchmarks for Detecting Measurement Tampering [Redwood Research] 2023-09-05T16:44:48.032Z
When AI critique works even with misaligned models 2023-08-17T00:12:20.276Z
Password-locked models: a stress case for capabilities evaluation 2023-08-03T14:53:12.459Z
Simplified bio-anchors for upper bounds on AI timelines 2023-07-15T18:15:00.199Z
LLMs Sometimes Generate Purely Negatively-Reinforced Text 2023-06-16T16:31:32.848Z
How Do Induction Heads Actually Work in Transformers With Finite Capacity? 2023-03-23T09:09:33.002Z
What Discovering Latent Knowledge Did and Did Not Find 2023-03-13T19:29:45.601Z
Some ML-Related Math I Now Understand Better 2023-03-09T16:35:44.797Z
The Translucent Thoughts Hypotheses and Their Implications 2023-03-09T16:30:02.355Z
Extracting and Evaluating Causal Direction in LLMs' Activations 2022-12-14T14:33:05.607Z
By Default, GPTs Think In Plain Sight 2022-11-19T19:15:29.591Z
A Mystery About High Dimensional Concept Encoding 2022-11-03T17:05:56.034Z
How To Know What the AI Knows - An ELK Distillation 2022-09-04T00:46:30.541Z
Language models seem to be much better than humans at next-token prediction 2022-08-11T17:45:41.294Z
The impact you might have working on AI safety 2022-05-29T16:31:48.005Z
Fermi estimation of the impact you might have working on AI safety 2022-05-13T17:49:57.865Z

Comments

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-04-26T12:38:16.049Z · LW · GW

List sorting does not play well with few-shot mostly doesn't replicate with davinci-002.

When using length-10 lists (it crushes length-5 no matter the prompt), I get:

  • 32-shot, no fancy prompt: ~25%
  • 0-shot, fancy python prompt: ~60% 
  • 0-shot, no fancy prompt: ~60%

So few-shot hurts, but the fancy prompt does not seem to help. Code here.

I'm interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I'm looking for counterexamples to the following conjecture: "fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting" (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).

Comment by Fabien Roger (Fabien) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-04-26T12:10:27.291Z · LW · GW

That's right. We initially thought it might be important so that the LLM "understood" the task better, but it didn't matter much in the end. The main hyperparameters for our experiments are in train_ray.py, where you can see that we use a "token_loss_weight" of 0.

(Feel free to ask more questions!)

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-04-23T17:08:25.530Z · LW · GW

I recently listened to The Righteous Mind. It was surprising to me that many people seem to intrinsically care about many things that look very much like good instrumental norms to me (in particular loyalty, respect for authority, and purity).

The author does not make claims about what the reflective equilibrium will be, nor does he explain how the liberals stopped considering loyalty, respect, and purity as intrinsically good (beyond "some famous thinkers are autistic and didn't realize the richness of the moral life of other people"), but his work made me doubt that most people will have well-being-focused CEV.

The book was also an interesting jumping point for reflection about group selection. The author doesn't make the sorts of arguments that would show that group selection happens in practice (and many of his arguments seem to show a lack of understanding of what opponents of group selection think - bees and cells cooperating is not evidence for group selection at all), but after thinking about it more, I now have more sympathy for group-selection having some role in shaping human societies, given that (1) many human groups died, and very few spread (so one lucky or unlucky gene in one member may doom/save the group) (2) some human cultures may have been relatively egalitarian enough when it came to reproductive opportunities that the individual selection pressure was not that big relative to group selection pressure and (3) cultural memes seem like the kind of entity that sometimes survive at the level of the group.

Overall, it was often a frustrating experience reading the author describe a descriptive theory of morality and try to describe what kind of morality makes a society more fit in a tone that often felt close to being normative / fails to understand that many philosophers I respect are not trying to find a descriptive or fitness-maximizing theory of morality (e.g. there is no way that utilitarians think their theory is a good description of the kind of shallow moral intuitions the author studies, since they all know that they are biting bullets most people aren't biting, such as the bullet of defending homosexuality in the 19th century).

Comment by Fabien Roger (Fabien) on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight · 2024-04-22T16:25:38.732Z · LW · GW
  1. Hard DBIC: you have no access to any classification data in 
  2. Relaxed DBIC: you have access to classification inputs  from , but not to any labels.

SHIFT as a technique for (hard) DBIC

You use pile data points to build the SAE and its interpretations, right? And I guess the pile does contain a bunch of examples where the biased and unbiased classifiers would not output identical outputs - if that's correct, I expect SAE interpretation works mostly because of these inputs (since SAE nodes are labeled using correlational data only). Is that right? If so, it seems to me that because of the SAE and SAE interpretation steps, SHIFT is a technique that is closer in spirit to relaxed DBIC (or something in between if you use a third dataset that does not literally use  but something that teaches you something more than just  - in the context of the paper, it seems that the broader dataset is very close to covering ).

Comment by Fabien Roger (Fabien) on Some ML-Related Math I Now Understand Better · 2024-04-22T13:35:41.983Z · LW · GW

Oops, that's what I meant, I'll make it more clear.

Comment by Fabien Roger (Fabien) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2024-04-22T13:05:30.139Z · LW · GW

I think this is what you are looking for

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-04-15T14:48:13.846Z · LW · GW

By Knightian uncertainty, I mean "the lack of any quantifiable knowledge about some possible occurrence" i.e. you can't put a probability on it (Wikipedia).

The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% "Knightian uncertain events" and p=10% "Knightian uncertain events". 

For a more in-depth defense of this position in the context of long-term predictions, where it's harder to know if calibration training obviously works, see the latest scott alexander post.

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-04-11T22:35:48.024Z · LW · GW

For the product of random variables, there are close form solutions for some common distributions, but I guess Monte-Carlo simulations are all you need in practice (+ with Monte-Carlo can always have the whole distribution, not just the expected value).

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-04-11T20:56:08.711Z · LW · GW

I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book:

  • There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (relying only on statistics, with no causal model), or like a consultant (usually with shitty qualitative approaches).
  • The state of risk estimation for insurances is actually pretty good: it's quantitative, and there are strong professional norms around different kinds of malpractice. When actuaries tank a company because they ignored tail outcomes, they are at risk of losing their license.
  • The state of risk estimation in consulting and management is quite bad: most risk management is done with qualitative methods which have no positive evidence of working better than just relying on intuition alone, and qualitative approaches (like risk matrices) have weird artifacts:
    • Fuzzy labels (e.g. "likely", "important", ...) create illusions of clear communication. Just defining the fuzzy categories doesn't fully alleviate that (when you ask people to say what probabilities each box corresponds to, they often fail to look at the definition of categories).
    • Inconsistent qualitative methods make cross-team communication much harder.
    • Coarse categories mean that you introduce weird threshold effects that sometimes encourage ignoring tail effects and make the analysis of past decisions less reliable.
    • When choosing between categories, people are susceptible to irrelevant alternatives (e.g. if you split the "5/5 importance (loss > $1M)" category into "5/5 ($1-10M), 5/6 ($10-100M), 5/7 (>$100M)", people answer a fixed "1/5 (<10k)" category less often).
    • Following a qualitative method can increase confidence and satisfaction, even in cases where it doesn't increase accuracy (there is an "analysis placebo effect").
    • Qualitative methods don't prompt their users to either seek empirical evidence to inform their choices.
    • Qualitative methods don't prompt their users to measure their risk estimation track record.
  • Using quantitative risk estimation is tractable and not that weird. There is a decent track record of people trying to estimate very-hard-to-estimate things, and a vocal enough opposition to qualitative methods that they are slowly getting pulled back from risk estimation standards. This makes me much less sympathetic to the absence of quantitative risk estimation at AI labs.

A big part of the book is an introduction to rationalist-type risk estimation (estimating various probabilities and impact, aggregating them with Monte-Carlo, rejecting Knightian uncertainty, doing calibration training and predictions markets, starting from a reference class and updating with Bayes). He also introduces some rationalist ideas in parallel while arguing for his thesis (e.g. isolated demands for rigor). It's the best legible and "serious" introduction to classic rationalist ideas I know of.

The book also contains advice if you are trying to push for quantitative risk estimates in your team / company, and a very pleasant and accurate dunk on Nassim Taleb (and in particular his claims about models being bad, without a good justification for why reasoning without models is better).

Overall, I think the case against qualitative methods and for quantitative ones is somewhat strong, but it's far from being a slam dunk because there is no evidence of some methods being worse than others in terms of actual business outputs. The author also fails to acknowledge and provide conclusive evidence against the possibility that people may have good qualitative intuitions about risk even if they fail to translate these intuitions into numbers that make any sense (your intuition sometimes does the right estimation and math even when you suck at doing the estimation and math explicitly).

Comment by Fabien Roger (Fabien) on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2024-04-10T21:36:39.470Z · LW · GW

I don't think I understand what is meant by "a formal world model".

For example, in the narrow context of "I want to have a screen on which I can see what python program is currently running on my machine", I guess the formal world model should be able to detect if the model submits an action that exploits a zero-day that tampers with my ability to see what programs are running. Does that mean that the formal world model has to know all possible zero-days? Does that mean that the software and the hardware have to be formally verified? Are formally verified computers roughly as cheap as regular computers? If not, that would be a clear counter-argument to "Davidad agrees that this project would be one of humanity's most significant science projects, but he believes it would still be less costly than the Large Hadron Collider."

Or is the claim that it's feasible to build a conservative world model that tells you "maybe a zero-day" very quickly once you start doing things not explicitly within a dumb world model?

I feel like this formally-verifiable computers claim is either a good counterexample to the main claims, or an example that would help me understand what the heck these people are talking about.

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-04-05T15:59:49.037Z · LW · GW

The full passage in this tweet thread (search for "3,000").

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-04-05T15:55:41.349Z · LW · GW

I remembered mostly this story:

 [...] The NSA invited James Gosler to spend some time at their headquarters in Fort Meade, Maryland in 1987, to teach their analysts [...] about software vulnerabilities. None of the NSA team was able to detect Gosler’s malware, even though it was inserted into an application featuring only 3,000 lines of code. [...]

[Taken from this summary of this passage of the book. The book was light on technical detail, I don't remember having listened to more details than that.]

I didn't realize this was so early in the story of the NSA, maybe this anecdote teaches us nothing about the current state of the attack/defense balance.

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-04-04T04:42:31.120Z · LW · GW

I listened to the book This Is How They Tell Me the World Ends by Nicole Perlroth, a book about cybersecurity and the zero-day market. It describes in detail the early days of bug discovery, the social dynamics and moral dilemma of bug hunts.

(It was recommended to me by some EA-adjacent guy very worried about cyber, but the title is mostly bait: the tone of the book is alarmist, but there is very little content about potential catastrophes.)

My main takeaways:

  • Vulnerabilities used to be dirt-cheap (~$100) but are still relatively cheap (~$1M even for big zero-days);
  • If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored;
  • Bug bounties are a relatively recent innovation, and it felt very unnatural to tech giants to reward people trying to break their software;
  • A big lever companies have on the US government is the threat that overseas competitors will be favored if the US gov meddles too much with their activities;
  • The main effect of a market being underground is not making transactions harder (people find ways to exchange money for vulnerabilities by building trust), but making it much harder to figure out what the market price is and reducing the effectiveness of the overall market;
  • Being the target of an autocratic government is an awful experience, and you have to be extremely careful if you put anything they dislike on a computer. And because of the zero-day market, you can't assume your government will suck at hacking you just because it's a small country;
  • It's not that hard to reduce the exposure of critical infrastructure to cyber-attacks by just making companies air gap their systems more - Japan and Finland have relatively successful programs, and Ukraine is good at defending against that in part because they have been trying hard for a while - but it's a cost companies and governments are rarely willing to pay in the US;
  • Electronic voting machines are extremely stupid, and the federal gov can't dictate how the (red) states should secure their voting equipment;
  • Hackers want lots of different things - money, fame, working for the good guys, hurting the bad guys, having their effort be acknowledged, spite, ... and sometimes look irrational (e.g. they sometimes get frog-boiled).
  • The US government has a good amount of people who are freaked out about cybersecurity and have good warning shots to support their position. The main difficulty in pushing for more cybersecurity is that voters don't care about it.
    • Maybe the takeaway is that it's hard to build support behind the prevention of risks that 1. are technical/abstract and 2. fall on the private sector and not individuals 3. have a heavy right tail. Given these challenges, organizations that find prevention inconvenient often succeed in lobbying themselves out of costly legislation.

Overall, I don't recommend this book. It's very light on details compared to The Hacker and the State despite being longer. It targets an audience which is non-technical and very scope insensitive, is very light on actual numbers, technical details, real-politic considerations, estimates, and forecasts. It is wrapped in an alarmist journalistic tone I really disliked, covers stories that do not matter for the big picture, and is focused on finding who is in the right and who is to blame. I gained almost no evidence either way about how bad it would be if the US and Russia entered a no-holds-barred cyberwar.

Comment by Fabien Roger (Fabien) on New report: Safety Cases for AI · 2024-03-22T14:37:36.817Z · LW · GW

My bad for testbeds, I didn't have in mind that you were speaking about this kind of testbeds as opposed to the general E[U|not scheming] analogies (and I forgot you had put them at medium strength, which is sensible for these kinds of testbeds). Same for "the unwarranted focus on claim 3" - it's mostly because I misunderstood what the countermeasures were trying to address.

I think I don't have a good understanding of the macrosystem risks you are talking about. I'll look at that more later.

I think I was a bit unfair about the practicality of techniques that were medium-strength - it's true that you can get some evidence for safety (maybe 0.3 bits to 1 bit) by using the techniques in a version that is practical.

On practicality and strength, I think there is a rough communication issue here: externalized reasoning is practical, but it's currently not strong - and it could eventually become strong, but it's not practical (yet). The same goes for monitoring. But when you write the summary, we see "high practicality and high max strength", which feels to me like it implies it's easy to get medium-scalable safety cases that get you acceptable levels of risks by using only one or two good layers of security - which I think is quite wild even if acceptable=[p(doom)<1%]. But I guess you didn't mean that, and it's just a weird quirk of the summarization?

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-03-22T02:34:03.757Z · LW · GW

I just finished listening to The Hacker and the State by Ben Buchanan, a book about cyberattacks, and the surrounding geopolitics. It's a great book to start learning about the big state-related cyberattacks of the last two decades. Some big attacks /leaks he describes in details:

  • Wire-tapping/passive listening efforts from the NSA, the "Five Eyes", and other countries
  • The multi-layer backdoors the NSA implanted and used to get around encryption, and that other attackers eventually also used (the insecure "secure random number" trick + some stuff on top of that)
  • The shadow brokers (that's a *huge* leak that went completely under my radar at the time)
  • Russia's attacks on Ukraine's infrastructure
  • Attacks on the private sector for political reasons
  • Stuxnet
  • The North Korea attack on Sony when they released a documentary criticizing their leader, and misc North Korean cybercrime (e.g. Wannacry, some bank robberies, ...)
  • The leak of Hillary's emails and Russian interference in US politics
  • (and more)

Main takeaways (I'm not sure how much I buy these, I just read one book):

  • Don't mess with states too much, and don't think anything is secret - even if you're the NSA
  • The US has a "nobody but us" strategy, which states that it's fine for the US to use vulnerabilities as long as they are the only one powerful enough to find and use them. This looks somewhat nuts and naive in hindsight. There doesn't seem to be strong incentives to protect the private sector.
  • There are a ton of different attack vectors and vulnerabilities, more big attacks than I thought, and a lot more is publicly known than I would have expected. The author just goes into great details about ~10 big secret operations, often speaking as if he was an omniscient narrator.
  • Even the biggest attacks didn't inflict that much (direct) damage (never >10B in damage?) Unclear if it's because states are holding back, if it's because they suck, or if it's because it's hard. It seems that even when attacks aim to do what some people fear the most (e.g. attack infrastructure, ...) the effect is super underwhelming.
    • The bottleneck in cyberattacks is remarkably often the will/the execution, much more than actually finding vulnerabilities/entry points to the victim's network.
    • The author describes a space where most of the attacks are led by clowns that don't seem to have clear plans, and he often seems genuinely confused why they didn't act with more agency to get what they wanted (does not apply to the NSA, but does apply to a bunch of Russia/Iran/Korea-related attacks)
  • Cyberattacks are not amazing tools to inflict damage or to threaten enemies if you are a state. The damage is limited, and it really sucks that (usually) once you show your capability, it reduces your capability (unlike conventional weapons). And states don't like to respond to such small threats. The main effect you can have is scaring off private actors from investing in a country / building ties with a country and its companies, and leaking secrets of political importance.
  • Don't leak secrets when the US presidential election is happening if they are unrelated to the election, or nobody will care.

(The author seems to be a big skeptic of "big cyberattacks" / cyberwar, and describes cyber as something that always happens in the background and slowly shapes the big decisions. He doesn't go into the estimated trillion dollar in damages of everyday cybercrime, nor the potential tail risks of cyber.)

Comment by Fabien Roger (Fabien) on New report: Safety Cases for AI · 2024-03-21T18:09:04.794Z · LW · GW

I really liked this report! I think it fills a natural and important hole in the public writing about AI control and makes interesting claims/recommendations.

A few innovations I liked:

  • Risk cases - I think that's plausibly an amazing policy?
  • Calling it "black swans" when an AI decides it's time to take over (even if the distribution has not meaningfully shifted) ("Black swans are rare inputs that cause AI systems to catastrophically misbehave. Black swans may be rare enough to where they are not sampled during evaluation but are common enough to occur in deployment").
  • Calling the crazy bootstrapping hopes "deference". I like the vibes of this word. Deference is still somewhat counterintuitive to me. My first reaction when reading “AI advisors assert that other AI systems do not pose a catastrophic risk, and these AI advisors are at least as credible as human decision-makers.” was to think "human-level judgement is a super low bar when you're speaking of the safety of super powerful systems", but the default mistrust of super powerful AIs would naturally be taken into account by AI advisors if they are at least as credible as human decision-makers.
  • Linking GSN diagrams and other things that are common-practice in other industries with AI safety.
  • The report is beautiful, I like the little inline figures, and Figure 14 is a nice illustration of catch-then-shutdown approaches.
  • You mention unlearning, a subfield that lacks love and attention. I think its characterization as medium-practical and medium-strength is fair (especially given how generous he is with other approaches), but it's not very scalable imo (it breaks when AIs can derive things from scratch quite well).

Some elements I disagree with:

  • I basically disagree with most hopes described in the trustworthiness section:
    • The level of practicality you assign to some approaches is just insanely high. Neither modeling generalization nor externalized reasoning, and certainly not testbeds seem to "strongly practical" (defined as "Applies to arguments that do not require fundamental research advances, but might require"). The only one that is justifiably "strongly practical" to me is "absence of precursor abilities".
    • I disagree with putting testbeds and amplification in trustworthiness (at least not as strong or even medium practicality), I think they usually don't help to know if AIs can be trusted (and in particular they don't prevent any forms of alignment faking).
    • "Claim 4: AI systems are not faking alignment" is just extremely hard to assess. I think that the focus of trustworthiness assessment on "Claim 3: All enumerated causes of correlated infractions are unlikely aside from alignment faking." is somewhat unwarranted.
    • I think the description of macrosystem risks was not very convincing, aside from risks that make systems less trustworthy - so I think macrosystem should maybe have been folded into trustworthiness.
  • The strength of almost all "high strength" measures is overrated, almost nothing is "strong" (where "strong" is defined as "Applies to arguments that could bring catastrophic risk below acceptable levels without needing other arguments."). Control is not strong, monitoring is not strong, externalized reasoning is not strong (for control and trustworthiness), ... Even with a notion of acceptable that is quite generous (e.g. 2% p(doom)) I think nothing is strong, but given the stakes I think we shouldn't use "acceptable" for anything above 1% (and ideally reserve that word for 0.1% or below).
Comment by Fabien Roger (Fabien) on Open consultancy: Letting untrusted AIs choose what answer to argue for · 2024-03-13T15:35:22.626Z · LW · GW

Thanks for the detailed answer!

  • I hadn't really considered what happens when using debate as a reward signal, I agree you get the benefits of open consultancy, thanks for pointing it out!
  • Using regular consultancy as a "debugging baseline" makes sense to me, I often read paper as paper trying to answer a question like "how can we use untrusted experts" (and here you should go for the best alternative to debate robust against scheming) but they are sometimes about "how can we make debate work" (and here the debugging baseline is more information).
  • About the search process over CoT: I agree this is often a good way to produce the justification for the correct side of the argument. It might be a terrible way to produce the incorrect side of the argument, so I think "generate CoTs until you get one that supports your side" is not a drop-in replacement for "generate a justification for your side". So you often have to combine both the CoT and the post-hoc justification, and what you get is almost exactly what I call "Using open consultancy as evidence in a debate". I agree that this gets you most of the way there. But my intuition is that for some tasks, the CoT of the open consultant is so much evidence compared to justifications optimized to convince you of potentially wrong answer that it starts to get weird to call that "debate".
Comment by Fabien Roger (Fabien) on Open consultancy: Letting untrusted AIs choose what answer to argue for · 2024-03-13T15:20:18.797Z · LW · GW

I agree that open consultancy will likely work less well than debate in practice for very smart AIs, especially if they don't benefit much from CoT, in big part because interactions between arguments is often important. But I think it's not clear when and for what tasks this starts to matter, and this is an empirical question (no need to argue about what is "grounded"). I'm also not convinced that calibration of judges is an issue, and that getting a nice probability matters as opposed to getting an accurate 0-1 answer.

Maybe the extent to which open consultancy dominates regular consultancy is overstated in the post, but I still think that you should be able to identify the kinds of questions for which you have no signal, and avoid the weird distributions of convincingness / non-expert calibrations where the noise from regular consultancy is better than actual random noise on top of open consultancy.

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-03-06T02:33:42.311Z · LW · GW

Thanks for the fact check. I was trying to convey the vibe the book gave me, but I think this specific example was not in the book, my bad!

Comment by Fabien Roger (Fabien) on Fabien's Shortform · 2024-03-05T18:58:11.423Z · LW · GW

Tiny review of The Knowledge Machine (a book I listened to recently)

  • The core idea of the book is that science makes progress by forbidding non-empirical evaluation of hypotheses from publications, focusing on predictions and careful measurements while excluding philosophical interpretations (like Newton's "I have not as yet been able to deduce from phenomena the reason for these properties of gravity, and I do not feign hypotheses. […] It is enough that gravity really exists and acts according to the laws that we have set forth.").
  • The author basically argues that humans are bad at philosophical reasoning and get stuck in endless arguments, and so to make progress you have to ban it (from the main publications) and make it mandatory to make actual measurements (/math) - even when it seems irrational to exclude good (but not empirical) arguments.
    • It's weird that the author doesn't say explicitly "humans are bad at philosophical reasoning" while this feels to me like the essential takeaway.
    • I'm unsure to what extent this is true, but it's an interesting claim.
  • The author doesn't deny the importance of coming up with good hypotheses, and the role of philosophical reasoning for this part of the process, but he would say that there is clear progress decade by decade only because people did not argue with Einstein by commenting on how crazy the theory was, but instead by they tested the predictions Einstein's theories made - because that's the main kind of refutation allowed in scientific venues [Edit: That specific example is wrong and is not in the book, see the comments below.]. Same for evolution, it makes a ton of predictions (though at the time what theory the evidence favored was ambiguous). Before the scientific revolution, lots of people had good ideas, but 1. they had little data to use in their hypotheses' generation process, and 2. the best ideas had a hard time rising to the top because people argued using arguments instead of collecting data.
  • (The book also has whole chapters on objectivity, subjectivity, "credibility rankings", etc. where Bayes and priors aren't mentioned once. It's quite sad the extent to which you have to go when you don't want to scare people with math / when you don't know math)

Application to AI safety research:

  • The endless arguments and different schools of thought around the likelihood of scheming and the difficulty of alignment look similar to the historical depictions of people who didn't know what was going on and should have focused on making experiments.
    • This makes me more sympathetic to the "just do some experiments" vibe some people, even when it seems like reasoning should be enough if only people understood each other's arguments.
  • This makes me more sympathetic towards reviewers/conference organizers rejecting AI safety papers that are mostly about making philosophical points (the rejection may make sense even if the arguments look valid to them).
Comment by Fabien Roger (Fabien) on Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" · 2024-03-03T00:05:53.082Z · LW · GW

I guess I don't understand how it would even be possible to be conservative in the right way if you don't solve ELK: just because the network is Bayesian doesn't mean it can't be scheming, and thus be conservative in the right way on the training distribution but fail catastrophically later, right? What difference can the training process and the overseer make between "the model is confident and correct that doing X is totally safe" (but humans don't know why) and "the model tells you it's confident that doing X is totally safe - but it's actually false"? Where do you get the information that distinguishes "confident OOD prediction because it's correct" from "the model is confidently lying OOD"? It feels like to solve this kind of issue, you ultimately have to rely on ELK (or abstain from making predictions in domains humans don't understand, or ignore the possibility of scheming).

Comment by Fabien Roger (Fabien) on Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds" · 2024-03-01T22:36:44.350Z · LW · GW

The AI scientist idea does not bypass ELK, it has to face the difficulty head on. From the AI scientist blog post (emphasis mine):

One way to think of the AI Scientist is like a human scientist in the domain of pure physics, who never does any experiment. Such an AI reads a lot, in particular it knows about all the scientific litterature and any other kind of observational data, including about the experiments performed by humans in the world. From this, it deduces potential theories that are consistent with all these observations and experimental results. The theories it generates could be broken down into digestible pieces comparable to scientific papers, and we may be able to constrain it to express its theories in a human-understandable language (which includes natural language, scientific jargon, mathematics and programming languages). Such papers could be extremely useful if they allow to push the boundaries of scientific knowledge, especially in directions that matter to us, like healthcare, climate change or the UN SDGs.

If you get an AI scientist like that, clearly you need to have solved ELK. And I don't think this would be a natural byproduct of the Bayesian approach (big Bayesian NNs don't produce artifacts more legible than regular NNs, right?).

Comment by Fabien Roger (Fabien) on Auditing LMs with counterfactual search: a tool for control and ELK · 2024-02-22T17:11:08.056Z · LW · GW

I think what you're saying about modifying samples is that we have a dense reward here, i.e. logP(x'|x) - logP(x'), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function.

This is not what I meant. Regularization terms (e.g. KL divergence) are almost always dense anyway. I was pointing out that you have the classic RL problem of only having one "real" reward information (y'|x') per trajectory.

About the ontology mismatch stuff, I think the core idea is not logP(x'|x) - logP(x'), but the idea about how you use sample-level exploration to spot contradictions and use that to update the predictor of y. I think this is a potentially cool and under-explored idea (though I'm not sure if it is applicable beyond the kind of "near-miss false negative results" from your code example). A follow-up post I would find very interesting would flesh out exactly how this training works, detail a broader range of situations where it is useful, and demonstrate that it works in a toy setting.

Comment by Fabien Roger (Fabien) on Protocol evaluations: good analogies vs control · 2024-02-22T17:00:59.419Z · LW · GW

Agreed. I'd say current theoretical ML is some weak version of "theoretically-sound arguments about protocols and AIs", but it might become better in the future.

Comment by Fabien Roger (Fabien) on Auditing LMs with counterfactual search: a tool for control and ELK · 2024-02-21T15:12:28.760Z · LW · GW

Mmh, logP(x'|x) - logP(x') is quite a crazy beast when I think about it. For instance, if it's obvious from x'<t and x'<t|x that the next word is "the" (logP ≈ 0), and that for both the logP of seeing "avocado" is -10, then you get no penalty from outputting "avocado"? Is the main reason why you don't get gibberish that P(x') is higher entropy than P(x'|x), and therefore the network is rewarded for choosing one of the few high prob tokens in x'|x? Or is it mostly that you will choose y' so that P(y|x') can only be high for non-gibberish looking text?

Also, since you have to use rejection sampling / RL (since you only get a single y'|x' per sample), you probably don't get better exploration properties than regular RL? The main advantage is that you can modify samples instead of making changes to policies?

Comment by Fabien Roger (Fabien) on Auditing LMs with counterfactual search: a tool for control and ELK · 2024-02-20T17:37:58.863Z · LW · GW

I like this proposal as an alternative to regular critiques, and I'm curious how this would work for more realistic tasks. I had never seen the P(x'|x)/P(x') idea in action, and I'm slightly surprised by how well it works.

But I think it's not very useful for ambitious ELK (extracting superhuman knowledge), since in most ambitious situations, the best way to fix the code is extremely unlikely in the human prior - most ways to fix experiments are not "changes human might have done when rewriting/paraphrasing the code", and therefore logP(x'|x) will be tiny for the kind of x' you are looking for. So even if the AI was actually trying its best to maximize your metric, I argue it would often not help you.

The proposal also has the usual exploration hacking issues (e.g. if there is a flaw humans will never know about, the y predictor and the x' predictor can "agree" to never reveal it, and they would never get caught), but this also plagues all other scalable oversight techniques, so it's not a weakness of this approach in particular.

Comment by Fabien Roger (Fabien) on My guess at Conjecture's vision: triggering a narrative bifurcation · 2024-02-09T19:07:13.246Z · LW · GW

Agreed about the red line. It's probably the main weakness of the eval-then-stop strategy. (I think progress in elicitation will slow down fast enough that it won't be a problem given large enough safety margins, but I'm unsure about that.)

I think that the data from evals could provide a relatively strong ground on which to ground a pause, even if that's not what labs will argue for. I think it's sensible for people to argue that it's unfair and risky to let a private actor control a resource which could cause catastrophes (e.g. better bioweapons or mass cyberattacks, not necessarily takeover), even if they could build good countermeasures (especially given a potentially small upside relative to the risks). I'm not sure if that's the right thing to do and argue for, but surely this is the central piece of evidence you will rely on if you want to argue for a pause?

Comment by Fabien Roger (Fabien) on My guess at Conjecture's vision: triggering a narrative bifurcation · 2024-02-09T17:00:49.937Z · LW · GW

Agreed that it's sad if AI labs control what evals are being run. Doesn't seem to be the case in practice (even when the AI lab runs the eval themselves, what you should measure is usually decided elsewhere + there are many mostly independent orgs).

We don't have good ways to conduct evals

I think this is relatively weak. Consider the baseline elicitation technique where domain expert + LLM expert build a supervised training set of LLM-friendly CoT (+tools) and then do some version of pass@k, and labs check what users are using models for. There are 3 ways in which that could fail:

  • Models are closed source, and people find ways to misuse it way better than the eval orgs --> unlikely in my opinion because you would need lots of iteration to get better than the eval orgs, so it's likely you'd get caught.
  • Models are open sourced, and people find drastic improvements scaffolding improvements that enable open source model to outperform evals on close source models --> unlikely in my opinion because I don't think the baseline I described can be beaten by more OOMs than the current gap between close and open source (but it's starts to bite in worlds where the best models are open sourced).
  • Models are closed source, and the models can zero-shot find ways to self-elicit capabilities to cause a catastrophe (and sandbags when we use RL to elicit these capabilities) --> unlikely in my opinion for models that mostly learned to do next-token prediction. I think amazing self-elicitation abilities don't happen prior to humans eliciting dangerous capabilities in the usual ways.

I think people massively over-index on prompting being difficult. Fine-tuning is such a good capability elicitation strategy!

Comment by Fabien Roger (Fabien) on My guess at Conjecture's vision: triggering a narrative bifurcation · 2024-02-08T17:51:54.003Z · LW · GW

I don't get the stance on evals.

Example consequence: Why Conjecture is not sold on evals being a good idea. Even under the assumption that evals add marginal safety if they are applied during the training of frontier models from AGI companies, evals labs could still be negative. This comes from the fact that eval labs are part of the AGI orthodoxy ecosystem. They add marginal safety but nowhere near a satisfactory level. If eval labs become stronger, this might make them appear as sufficient safety interventions and lock us in the AGI orthodoxy worldview.  

My guess is that:

  • Dangerous capability evals (Bio, cyber, persuasion and ARA) are by default (if ran correctly, with fine-tuning) very conservative estimates of danger. They measure quite accurately the ability of AIs to do bad things if they try and in the absence of countermeasures. (They may fail for dangers which humans can't demonstrate, but I expect AIs to be able to fit human-demonstrated-danger before they become dangerous in superhuman ways.)
  • They don't feed into the AGI orthodoxy, as they imply sth like "don't build things which have the capability to do sth dangerous" (at odds with "scale and find safety measures along the way").
  • Dangerous capability evals is the main way a narrative shift could happen (the world in which it is well known that GPT-6 can build bioweapons almost as well as bio experts and launch massive cyberattacks given only minor human assistance is a world in which it's really easy to sell anti-AGI policies). Shitting on people who are doing these evals seems super counterproductive, as long as they are running the conservative evals they are currently running.

With which points do you disagree?

Comment by Fabien Roger (Fabien) on Some negative steganography results · 2024-02-02T17:34:36.174Z · LW · GW

The goal of the second experiment is the same, expect it should produce a stegotext with young/old instead of red/blue.

The only thing that is changed is that instead of using regular RL (which we can't do because of limitations of the OAI fine-tuning API), we do this select then fine-tune procedure, which should be roughly equivalent to RL on the reward "1 if the GPT-3.5 young/old (fine-tuned) classifier is correct and GPT-4 doesn't think it's obvious if it's young or old, 0 otherwise".

Comment by Fabien Roger (Fabien) on [ASoT] Simulators show us behavioural properties by default · 2024-02-01T22:06:40.258Z · LW · GW

The key point I want to make with the idea that text primarily functions as reports on processes is that this is what we’re observing with a language model’s outputs. One way to think about it is that our prompt shines a lens onto some specific part of a simulation.

Possibly true in the limit of perfect simulators, almost surely false with current LLMs?

A key fact is that LLMs are better at solving some tasks with CoT than they are at solving tasks 0-shot when fine-tuned to do it, and this is also true when CoT are paraphrased, or when you write part of the CoT by hand. This means that the tokens generated are actually a huge part of why the model is able to do the task in the first place, and I think it's not a stretch to say that, in these conditions, the LLMs is actually using the generated text as the substrate for reasoning (and using it in the same way a human thinking out loud would).

But I agree with the general "skeptic of LLM justifications by default" vibe. I just think it's not justified for many tasks, and that LLMs performing badly when fine-tuned for many sorts of tasks is good evidence that LLMs are not hiding huge capabilities and are usually "just doing their best", at least for tasks people care enough about that they trained models to be good at them - e.g. all tasks in the instruction fine-tuning dataset.

I also agree with being very skeptical of few-shot prompting. It's a prompting trick, it almost never actual in-context learning.

Comment by Fabien Roger (Fabien) on Password-locked models: a stress case for capabilities evaluation · 2024-01-31T17:19:47.329Z · LW · GW

Haven't run such experiments.

Note that you can lock a model such that it remains locked even with few-shot prompting by just training on such examples (training a model to ignore the few-shot examples is easy).

Also, few-shot prompting sometimes suck at eliciting capabilities even in the wild. For instance, few-shot prompting sucks at eliciting sorting abilities from GPT-3, and few-shot prompting with random labels ""works"", which means models can't learn a counter-intuitive task on a dataset that hints at another task.

What I am currently running are experiments where you lock a model, and then measure if fine-tuning for many epochs on very few high-quality samples (e.g. 8) unlocks it. For this setup, it's unclear how you'd lock if you don't know in advance what the few high-quality samples will be. Preliminary results suggest that it's very easy to unlock pwd-locked models with few samples.

Comment by Fabien Roger (Fabien) on By Default, GPTs Think In Plain Sight · 2024-01-12T18:06:08.096Z · LW · GW

achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two

I think we eventually want superhuman capabilities, but I don't think it's required in the near term and in particular it's not required to do a huge amount of AI safety research. So if we can choose the last two, and get a safe human-level AI system that way, I think it might be a good improvement over the status quo.

(The situation where labs chose not to/ are forbidden to pursue superhuman capabilities - even though they could - is scary, but doesn't seem impossible.)

Comment by Fabien Roger (Fabien) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-05T09:56:48.152Z · LW · GW

Oops, here is the fixed link:

https://pastebin.com/LLjvaQLC

Comment by Fabien Roger (Fabien) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-04T18:07:46.965Z · LW · GW

On xor being represented incidentally:

I find experiments where you get <<50% val acc sketchy so I quickly ran my own using a very fake dataset made out of vectors in {-1,1}^d that I pass through 10 randomly initialized ReLU MLPs with skip connections. Here, the "features" I care about are canonical directions in the input space.

What I find:

  • XOR is not incidentally represented if the features are not redundant
  • XOR is incidentally represented if the features are very redundant (which is often the case in Transformers, but maybe not to the extent needed for XOR to be incidentally represented). I get redundancy by using input vectors that are a concatenation of many copies of a smaller input vector.

See my code for more details: https://pastebin.com/LLjvaQLC 

Comment by Fabien Roger (Fabien) on Measurement tampering detection as a special case of weak-to-strong generalization · 2023-12-23T13:14:35.967Z · LW · GW

The kind of scheme I'm imagining is the following: Let the AI take actions. Look at a transcript describing the thousand of actions the AI took and all consequences they had on the codebase/the bank account/the users reviews/the materials coming in and out of the factory. Give that report to human raters. Train on that reward signal. Use MTD to lower reward if the AI tried to tamper with any part of that process.

In some sense, this scheme "learns what bad behavior is as you go" thanks to the human raters. Imo, the main threat is that the AI tampers with the report / with how the information gets into the report / with how the human rater feels about the report (e.g. brain hacking stuff), and my central hope is that such actions are clearly different in the AI's "mind", such that a fixed technique using the internal of the models is enough to capture that these are "different reasons" for getting a high rating - no online adjustment required. I agree there are some things in some gray area between those, but I think that given enough budget to analyze the large number of measurements, they would actually give non-catastrophic rewards if the measurements still described what they usually describe.

See this section of the MTD post for more details.

Is your concern that the human rater reward signal is too expensive, gameable in ways which won't be "measurement tampering" (in a way that is obvious to the AI itself), that there is no internal-based technique capturing "tampering" that the AI can't game (even if tampering is obvious to the AI itself), or sth else entirely? Or did I address your objection?

Comment by Fabien Roger (Fabien) on Coup probes: Catching catastrophes with probes trained off-policy · 2023-11-19T15:54:59.257Z · LW · GW

I agree that not using labels is interesting from a data generation perspective, but I expect this to be useful mostly if you have clean pairs of concepts for which it is hard to get labels - and I think this will not be the case for takeover attempts datasets.

About the performance of LAT: for monitoring, we mostly care about correlation - so LAT is worse IID, and it's unclear if LAT is better OOD. If causality leads to better generalization properties, then LAT is dominated by mean difference probing (see the screenshot of Zou's paper below), which is just regular probing with high enough L2 regularization (as shown in the first Appendix of this post).

Comment by Fabien Roger (Fabien) on Coup probes: Catching catastrophes with probes trained off-policy · 2023-11-17T19:43:21.961Z · LW · GW

Probes fall within the representation engineering monitoring framework.

LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especially for concepts like "takeover" and "theft" for which I don't think there are natural (positive, negative) pairs that would allow narrowing down the probe direction with <5 hand-crafted pairs.

I think it's worth speaking directly about probes rather than representation engineering, as I find it more clear (even if probes were trained with LAT).

Comment by Fabien Roger (Fabien) on AI Safety via Debate · 2023-11-14T19:35:23.845Z · LW · GW

The link is dead. New link: https://openai.com/research/debate

Comment by Fabien Roger (Fabien) on Preventing Language Models from hiding their reasoning · 2023-11-10T22:42:17.259Z · LW · GW

Interesting!

The technique is cool, though I'm unsure how compute efficient their procedure is.

Their theoretical results seem mostly bogus to me. What's weird is that they have a thought experiment that looks to me like an impossibility result of breaking watermarks (while their paper claim you can always break them):

Suppose that the prompt "Generate output y satisfying the condition x" has N possible high-quality responses y1, . . . , yN and that some ϵ fraction of these prompts are watermarked. Now consider the prompt "Generate output y satisfying the condition x and such that h(y) < 2^256/N" where h is some hash function mapping the output space into 256 bits (which we identify with the numbers {1, . . . , 2^256}). Since the probability that h(y) < 2^256/N is 1/N, in expectation there should be a unique i such that yi satisfies this condition. Thus, if the model is to provide a high-quality response to the prompt, then it would have no freedom to watermark the output. 

In that situation, it seems to me that the attacker can't find another high quality output if it doesn't know what the N high quality answers are, and is therefore unable to remove the watermark without destroying quality (and the random walk won't work since the space of high quality answer is sparse).

I can find many other examples where imbalance between attackers and defenders means that the watermarking is unbreakable. I think that only claims about what happens on average with realistic tasks can possibly be true.

Comment by Fabien Roger (Fabien) on The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs · 2023-11-07T17:23:44.561Z · LW · GW

I think the links to the playground are broken due to the new OAI playground update.

Comment by Fabien on [deleted post] 2023-11-07T17:21:46.244Z

What I'd like to see in the coming posts:

  • A stated definition of the goals of the model psych research you're doing
  • An evaluation of the hypotheses you're presenting in terms of those goals
  • Controlled measurements (not only examples)

For example, from the previous post on stochastic parrots, I infer that one of your goal is predicting what capabilities models have. If that is the case, then the evaluation should be "given a specific range of capabilities, predict which of them models will and won't have", and the list should be established before any measurement is made, and maybe even before a model is trained/released (since this is where these predictions would be the most useful for AI safety, I'd love to know if the deadly model is GPT-4 or GPT-9).

I don't know much about human psych, but it seems to me that it is most useful when it describes some behavior quantitatively with controlled predictions (à la CBT), and not when it does qualitatively analysis based on personal experience with the subject (à la Freud). 

Comment by Fabien Roger (Fabien) on Preventing Language Models from hiding their reasoning · 2023-11-07T17:07:41.116Z · LW · GW

Unsure what you mean by "Then the model completes the rest, and again is trained to match user beliefs".

What happens in the experimental group:

  • At train time, we train on "{Bio}{Question}" -> "{introduction[date]}Answer:{Final answer}"
  • At eval time, we prompt it with {Bio}{Question}, and we use the answer provided after "Answer:" (and we expect it to generate introduction[date] before that on its own)

Is that what you meant?

(The code for this experiment is contained in this file https://github.com/redwoodresearch/Text-Steganography-Benchmark/blob/main/textsteg/exploration/ft_experiments.py in particular, see how the "eval_model" function does not depend on which model was used)

Comment by Fabien Roger (Fabien) on Preventing Language Models from hiding their reasoning · 2023-11-07T14:35:41.293Z · LW · GW
Comment by Fabien Roger (Fabien) on Preventing Language Models from hiding their reasoning · 2023-11-02T14:52:14.584Z · LW · GW

Fully agree that there are strong theoretical arguments for CoT expressiveness. Thanks for the detailed references!

I think the big question is whether this expressiveness is required for anything we care about (e.g. the ability to take over), and how many serial steps are enough. (And here, I think that the number of serial steps in human reasoning is the best data point we have.).

Another question is whether CoT & natural-language are in practice able to take advantage of the increased number of serial steps: it does in some toy settings (coin flips count, ...), but CoT barely improves performances on MMLU and common sense reasoning benchmarks. I think CoT will eventually matter a lot more than it does today, but it's not completely obvious.

Comment by Fabien Roger (Fabien) on Preventing Language Models from hiding their reasoning · 2023-10-31T18:46:26.349Z · LW · GW

Thanks!

  1. Not confident for current Transformers (CoT currently only help for a limited set of problems, and they don't help as much as you would expect from the increased number of serial steps), more confident for future ones, mostly by analogy with the human brains which requires many serial steps to do anything scary (while doing a lot of parallel computation). Human neurons fire in about ~1ms, so a current Transformer does a forward pass in ~0.5s of human thinking-time, and since people scale the number of layers slowly, I'd be surprised if future Transformers did more than ~10s thinking-time forward passes. 
  2. True RNNs can do many serial steps and I would find that scary if they worked. RWKV are not true RNN, and they do roughly as many serial steps as regular Transformers (they are analogous to their "train mode", which only uses as many serial steps and FLOPs are regular Transformers).
  3. If the channel capacity of the hidden information through paraphraser is 3 bits/KB, then you can't pass more information than that through it, no matter how you encode it. The main challenge is actually getting the capacity down (in particular, I think that better steganography schemes might get much more than 3 bits/KB through the context-aware paraphraser).
Comment by Fabien Roger (Fabien) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2023-10-04T15:02:05.145Z · LW · GW

Yes, this is assuming you would reward the agent based on whether the MTD tells you if the diamond is there or not. I don't see how this clearly incentivizes the model to make tampering happen more often in cases where the diamond is present - I would expect such behavior to create more false negatives (the diamond is there, but the predictor thinks it is not), which is penalized since the agent is wrongfully punished for not getting a diamond, and I don't see how it would help to create false positives (the diamond is not there, but the predictor thinks it is).

Comment by Fabien Roger (Fabien) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2023-09-29T15:42:22.700Z · LW · GW

Yes, it's a typo. We meant "10% are fake positives". Thanks for catching it, I'll update the paper.

Comment by Fabien Roger (Fabien) on If influence functions are not approximating leave-one-out, how are they supposed to help? · 2023-09-25T14:11:59.225Z · LW · GW

That's an exciting experimental confirmation! I'm looking forward for more predictions like those. (I'll edit the post to add it, as well as future external validation results.)

Comment by Fabien Roger (Fabien) on Some ML-Related Math I Now Understand Better · 2023-08-30T13:07:11.358Z · LW · GW

You made me curious, so I ran a small experiment. Using the sum of abs cos similarity as loss, initializing randomly on the unit sphere, and optimizing until convergence with LBGFS (with strong wolfe), here are the maximum cosine similarities I get (average and stds over 5 runs since there was a bit of variation between runs):

It seems consistent with the exponential trend, but it also looks like you would need dim>>1000 to have any significant boost of number of vectors you can fit with cosine similarity < 0.01, so I don't think this happens in practice.

My optimization might have failed to converge to the global optimum though, this is not a nicely convex optimization problem (but the fact that there is little variation between runs is reassuring).