LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
Absolutely! @Jozdien recounting those anecdotes was one of the sparks for this research, as was janus showing in the comments that the base model could confidently identify gwern. (I see I've inexplicably failed to thank Arun at the end of my post, need to fix that).
Interestingly, I was able to easily reproduce the gwern identification using the public model, so it seems clear that these capabilities are not entirely RLHFed away, although they may be somewhat impaired.
eggsyntax on Language Models Model UsOh thanks, I'd missed that somehow & thought that only the temp mattered for that.
eggsyntax on Language Models Model UsThanks!
eggsyntax on Language Models Model UsThat used to work, but as of March you can only get the pre-logit_bias logprobs back. They didn't announce the change, but it's discussed in the OpenAI forums eg here. I noticed the change when all my code suddenly broke; you can still see remnants of that approach in the code.
akash-wasil on DeepMind's "Frontier Safety Framework" is weak and unambitious@Zach Stein-Perlman [LW · GW] I appreciate your recent willingness to evaluate and criticize safety plans from labs. I think this is likely a public good that is underprovided, given the strong incentives that many people have to maintain a good standing with labs (not to mention more explicit forms of pressure applied by OpenAI and presumably other labs).
One thought: I feel like the difference between how you described the Anthropic RSP and how you described the OpenAI PF feels stronger than the actual quality differences between the documents. I agree with you that the thresholds in the OpenAI PF are too high, but I think the PF should get "points" for spelling out risks that go beyond ASL-3/misuse.
OpenAI has commitments that are insufficiently cautious for ASL-4+ (or what they would call high/critical on model autonomy), but Anthropic circumvents this problem by simply refusing to make any commitments around ASL-4 (for now).
You note this limitation when describing Anthropic's RSP, but you describe it as "promising" while describing the PF as "unpromising." In my view, this might be unfairly rewarding Anthropic for just not engaging with the hardest parts of the problem (or unfairly penalizing OpenAI for giving their best guess answers RE how to deal with the hardest parts of the problem).
We might also just disagree on how firm or useful the commitments in each document are– I walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks. I do think OpenAI's thresholds are too high, but It's likely that I'll feel the same way about Anthropic's thresholds. In particular, I don't expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities (partially because the competitive pressures and race dynamics pushing them to do so will be intense). I don't see evidence that either lab is taking these kinds of risks particularly seriously, has ideas about what safeguards would be considered sufficient, or is seriously entertaining the idea that we might need to do a lot (>1 year) of dedicated safety work (that potentially involves coming up with major theoretical insights, as opposed to a "we will just solve it with empiricism" perspective) before we are confident that we can control such systems.
TLDR: I would remove the word "promising" and maybe characterize the RSP more like "an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4."
keltan on Why you should learn a musical instrumentI haven’t, but I’ll take a look. I appreciate the recommendation!
8e9 on AISafety.com – Resources for AI SafetySome thoughts about the intro video: it's excellent but ends a little abruptly; it would be good to explain how AISafety.com plans to help. Also, I quite dislike flashing images (around the 45-second mark).
algon on Why you should learn a musical instrumentSince you seem interested in nootropics, I wonder if you've read Gwern's list of nootropic self-experiments? He covers a lot of supplements, some of which are pretty obscure AFAICT.~
EDIT: https://gwern.net/nootropic/nootropics
This ability has been observed more prominently in base models. Cyborgs [LW · GW] have termed it 'truesight':
the ability (esp. exhibited by an LLM) to infer a surprising amount about the data-generation process that produced its prompt, such as a user's identity, motivations, or context.
Two cases of this are mentioned at the top of this linked post [LW · GW].
---
One of my first experiences with the GPT-4 base model [LW · GW] also involved being truesighted by it. Below is a short summary of how that went.
I had spent some hours writing and {refining, optimizing word choices, etc}[1] a more personal/expressive text. I then chose to format it as a blog post and requested multiple completions via the API, to see how the model would continue it. (It may be important that I wasn't in a state of mind of 'writing for the model to continue' and instead was 'writing very genuinely', since the latter probably has more embedded information)
One of those completions happened to be a (simulated) second post titled ideas i endorse
. Its contents were very surprising to then-me because some of the included beliefs were all of the following: {ones I'd endorse}, {statistically rare}, and {not ones I thought were indicated by the text}.[2]
I also tried conditioning the model to continue my text with..
Also, the most important thing the outputs failed to truesight was my current focus on AI and longtermism. (My text was not about those, but neither was it about the other beliefs mentioned.)
The sum of those choices probably contained a lot of information about my mind, just not information that humans are attuned to detecting. Base models learn to detect information about authors because this is useful to next token prediction.
Also note that using base models for this kind of experiment avoids the issue of the RLHF-persona being unwilling to speculate or decoupled from the true beliefs of the underlying simulator.
To be clear, it also included {some beliefs that I don't have}, and {some that I hadn't considered so far and probably wouldn't have spent cognition on considering otherwise, but would agree with on reflection. (eg about some common topics with little long-term relevance)}
I mean, if Paul doesn't confirm that he is not under any non-disparagement obligations to OpenAI like Cullen O' Keefe did, we have our answer.
In fact, given this asymmetry of information situation, it makes sense to assume that Paul is under such an obligation until he claims otherwise.