LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
"and seek amateur advice"
well said!
sameisenstat on The consistent guessing problem is easier than the halting problemYeah, there's a sort of trick here. The natural question is uniform--we want a single reduction that can work from any consistent guessing oracle, and we think it would be cheating to do different things with different oracles. But this doesn't matter for the solution, since we produce a single consistent guessing oracle that can't be reduced to the halting problem.
This reminds me of the theory of enumeration degrees, a generalization of Turing degrees allowing open-set-flavoured behaviour like we see in partial recursive functions; if the answer to an oracle query is positive, the oracle must eventually tell you, but if the answer is negative it keeps you waiting indefinitely. I find the theory of enumeration degrees to be underemphasized in discussion of computability theory, but e.g. Odifreddi has a chapter on it all the way at the end of Classical Recursion Theory Volume II.
The consistent guessing problem isn't a problem about enumeration degrees. It's using a stronger kind of uniformity--we want to be uniform over oracles that differently guess consistently, not over a set of ways to give the some answers, but to present them differently. But there is again a kind of strangeness in the behaviour of uniformity, since we get equivalent notions if we do or do not ask that a reduction between sets A, B be a single function that uniformly enumerates A from enumerations of B, so there might be some common idea here. More generally, enumeration degrees feel like they let us express more naturally things that are a bit awkward to say in terms of Turing degrees--it's natural to think about the set of computations that are enumerable/Σ1 in a set--so it might be a useful keyword.
zach-stein-perlman on DeepMind's "Frontier Safety Framework" is weak and unambitiousThanks.
Deployment mitigations level 2 discusses the need for mitigations on internal deployments.
Good point; this makes it clearer that "deployment" means external deployment by default. But level 2 only mentions "internal access of the critical capability," which sounds like it's about misuse — I'm more worried about AI scheming and escaping [LW · GW] when the lab uses AIs internally to do AI development.
ML R&D will require thinking about internal deployments (and so will many of the other CCLs).
OK. I hope DeepMind does that thinking and makes appropriate commitments.
two-party control
Thanks. I'm pretty ignorant on this topic.
"every 3 months of fine-tuning progress" was meant to capture [during deployment] as well
Yayyy!
bogdan-ionut-cirstea on Clarifying and predicting AGISome evidence in favor of the framework; from Advanced AI evaluations at AISI: May update:
Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks3. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.
I'm particularly interested in what the framework might say about the ordering in which various capabilities which are prerequisites for automated AI safety R&D might appear; and also ordering vs. various dangerous capabilities.
rohinmshah on DeepMind's "Frontier Safety Framework" is weak and unambitiousThanks for the detailed critique – I love that you actually read the document in detail. A few responses on particular points:
The document doesn't specify whether "deployment" includes internal deployment.
Unless otherwise stated, "deployment" to us means external deployment – because this is the way most AI researchers use the term. Deployment mitigations level 2 discusses the need for mitigations on internal deployments. ML R&D will require thinking about internal deployments (and so will many of the other CCLs).
Some people get unilateral access to weights until the top level. This is disappointing. It's been almost a year since Anthropic said it was implementing two-party control, where nobody can unilaterally access the weights.
I don't think Anthropic meant to claim that two-party control would achieve this property. I expect anyone using a cloud compute provider is trusting that the provider will not access the model, not securing it against such unauthorized access. (In principle some cryptographic schemes could allow you to secure model weights even from your cloud compute provider, but I highly doubt people are doing that, since it is very expensive.)
Mostly they discuss developers' access to the weights. This is disappointing. It's important but lots of other stuff is important too.
The emphasis on weights access isn’t meant to imply that other kinds of mitigations don’t matter. We focused on what it would take to increase our protection against exfiltration. A lot of the example measures discussed in the RAND interim report aren’t discussed because we already do them. For example, Google already does the following from RAND Level 3: (a) develop an insider threat program and (b) deploy advanced red-teaming. (That’s not meant to be exhaustive, I don’t personally know the details here.)
No mention of evals during deployment (to account for improvements in scaffolding, prompting, etc.).
Sorry, that's just poor wording on our part -- "every 3 months of fine-tuning progress" was meant to capture that as well. Thanks for pointing this out!
Talking about plans like this is helpful. But with no commitments, DeepMind shouldn't get much credit.
With the FSF, we prefer to try it out for a while and iron out any issues, particularly since the science is in early stages, and best practices will need to evolve as we learn more. But as you say, we are running evals even without official FSF commitments, e.g. the Gemini 1.5 tech report has dangerous capability evaluation results (see Section 9.5.2).
Given recent updates in AGI safety overall, I'm happy that GDM and Google leadership takes commitments seriously, and thinks carefully about which ones they are and are not willing to make. Including FSF, White House Commitments, etc.
The 20% of compute thing reminds me of this post from 2014:
I am Sam Altman, lead investor in reddit's new round and President of Y Combinator. AMA!
We're working on a way to give 10% of our shares from this round to the reddit community.
As far as I know, this didn't happen.
Though to be fair, Reddit is indeed doing the users-as-shareholders thing now, in 2024. But I guess it's unrelated to the plans from back then.
bec-hawk on Stephen Fowler's ShortformSo the argument is that Open Phil should only give large sums of money to (democratic) governments? That seems too overpowered for the OpenAI case.
eggsyntax on Language Models Model UsThanks! It was actually on my to-do list for this coming week to look for something like this for llama, it's great to have it come to me 😁
bec-hawk on Stephen Fowler's ShortformIn that case OP’s argument would be saying that donors shouldn’t give large sums of money to any sort of group of people, which is a much bolder claim
eggsyntax on Language Models Model UsOh, absolutely! I interpreted 'which famous authors an unknown author is most similar to' not as being about 'which famous author is this unknown sample from' but rather being about 'how can we characterize this non-famous author as a mixture of famous authors', eg 'John Doe, who isn't particularly expected to be in the training data, is approximately 30% Hemingway, 30% Steinbeck, 20% Scott Alexander, and a sprinkling of Proust'. And I think that problem is hard to test & score at scale. Looking back at the OP, both your and my readings seem plausible -- @jdp [LW · GW] would you care to disambiguate?
LLMs' ability to identify specific authors is also interesting and important; it's just not the problem I'm personally focused on, both because I expect that only a minority of people are sufficiently represented in the training data to be identifiable, and because there's already plenty of research out there on author identification, whereas ability to model unknown users based solely on their conversation with an LLM seems both important and underexplored.