LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
on Earth you don't get sufficient credit for sharing good policies and there's substantial negative EV from misunderstandings and adversarial interpretations, so I guess it's often correct to not share :(
What's the substantial negative EV that would come from misunderstanding or adversarial interpretations? I feel like in this case, worst-case would be like "the non-compliance reporting policy is actually pretty good but a few people say mean things about it and say 'see, here's why we need government oversight.' But this feels pretty minor/trivial IMO.
As an 80/20 of publishing, maybe you could share a policy with an external auditor who would then publish whether they think it's good or have concerns. I would feel better if that happened all the time
This is clever, +1.
alex-4 on Raising children on the eve of AIWell said. We've been contemplating expanding our family lately and I have to say, I've been secretly thinking many of the same things. That said, if we want humanity to persist and have a chance of one day prospering alongside AI and other technologies to come, children seem like a pretty clear prerequisite (particularly from people like us who care about these bigger pictures). I personally believe there will likely be non-trivial socioeconomic inequality and strife in the wake of AGI, however, I believe that these timescales will be on the order of decades (not weeks or months). In short, I believe that raising future generations to care about the future of humanity is incredibly important.
On a brighter note, I personally think a few things could be worthwhile to think about in preparing our children for the uncertainty that will very likely come with a post-AGI world. Purely IMO and I realize these things are not available or practical for everyone but just wanted to share a few thoughts:
Why? Because extra information could help me impress them.
I've always been pretty against the idea of trying to impress people on dates.
It risks false positives. Ie. it risks a situation where you succeed at impressing them, go on more dates or have a longer relationship than you otherwise would, and then realize that you aren't compatible and break up. Which isn't necessarily a bad thing but I think it is more often than not.
Impressing your date also reduces the risk of false negatives, which is a good thing. Ie. it helps avoid the scenario where someone who you're compatible with rejects you. Maybe this is too starry-eyed, but I like to think that if you just bring your true self to the table, are open-minded, and push yourself to be a little vulnerable, the risk of such false negatives is pretty low.
I think this is especially relevant because I think the emotionally healthy person heuristic probably says to try to impress your date.
akash-wasil on New voluntary commitments (AI Seoul Summit)even if you are skeptical of the value of RSPs, I think you should be in favor of a specific name for it so you can distinguish it from other, future voluntary safety policies that you are more supportive of
This is a great point– consider me convinced. Interestingly, it's hard for me to really precisely define the things that make something an RSP as opposed to a different type of safety commitment, but there are some patterns in the existing RSP/PF/FSF that do seem to put them in a broader family. (Ex: Strong focus on model evaluations, implicit assumption that AI development should continue until/unless evidence of danger is found, implicit assumption that company executives will decide once safeguards are sufficient).
ryan_greenblatt on Stephen Fowler's ShortformI feel frustrated that your initial comment (which is now the top reply) implies I either hadn't read the 1700 word grant justification that is at the core of my argument, or was intentionally misrepresenting it to make my point.
I think this comment is extremely important for bystanders to understand the context of the grant and it isn't mentioned in your original short form post.
So, regardless of whether you understand the situation, it's important that other people understand the intention of the grant (and this intention isn't obvious from your original comment). Thus, this comment from Buck is valuable.
I also think that the main interpretation from bystanders of your original shortform would be something like:
Fair enough if this wasn't your intention, but I think it will be how bystanders interact with this.
tailcalled on tailcalled's ShortformGiven the large number of dimensions that are kept in each case, there must be considerable overlap in which dimensions they make use of. But how much?
I concatenated the dimensions found in each of the prompts, and performed an SVD of it. It yielded this plot:
... unfortunately this seems close to the worst-case scenario. I had hoped for some split between general and task-specific dimensions, yet this seems like an extremely uniform mixture.
ryan_greenblatt on Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI SystemsA core advantage of Bayesian methods is the ability to handle out-of-distribution situations more gracefully
I dispute that Bayesian methods will be much better at this in practice.
[
Aside:
In general, most (?) AI safety problems can be cast as an instance of a case where a model behaves as intended on a training distribution
This seems like about 1/2 of the problem from my perspective. (So I almost agree.) Though, you can shove all AI safety problems into this bucket by doing a maneuver like "train your model on the easy cases humans can label, then deploy into the full distribution". But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.)
]
Traditional ML has no straightforward way of dealing with such cases, since it only maintains a single hypothesis at any given time.
Is this true? Aren't NN implicitly ensembles of vast number of models? Also, does ensembling 5 NNs help? If this doesn't help why does sampling 5 models from the Bayesian posterior help? Or is that we needed to approximate sampling 1,000,000 models from the posterior? If we're conservative over a million models, how will we ever do anything?
However, Bayesian methods may make it less likely that a model will misgeneralise, or should at least give you a way of detecting when this is the case.
Do they? I'm skeptical on both of these. It maybe helps a little and rules out some unlikely scenarios, but I'm overall skeptical.
Overall, my view on the Bayesian approach is something like:
I also don't agree with the characterisation that "almost all the interesting work is in the step where we need to know whether a hypothesis implies harm" (if I understand you correctly). Of course, creating a formal definition or model of "harm" is difficult, and creating a world model is difficult, but once this has been done, it may not be very hard to detect if a given action would result in harm.
My claim here is that all the interesting work is in ensuring that we know whether a hypothesis "thinks" that harm will result. It would be fine to put this work in constructing an intepretable hypothesis such that we can know whether it causes harm or constructing a formal model of harm and ensuring we have access to all important latent variables for this formal model, but this work still must be done.
Another way to put this, is that all the interesting action was happening at the point where you solved the ELK problem. I agree that if:
You're fine. But, step (1) is just the ELK problem and I don't even really think you need to solve step (2) for most plans. (You can just have humans compute step (2) manually for most types of latent variables, though this does have some issues.)
Specifically, the world model does not necessarily have to be built manually
I thought the plan was to build it with either AI labor or human labor so that it will be sufficiently intepretable. Not to e.g. build it with SGD. If the plan is to build it with SGD and not to ensure that it is interpretable, then why does it provide any safety guarantee? How can we use the world model to define a harm predicate?
it does not have to be as good at prediction as our AI. The world model only needs to be good at predicting the variables that are important for the safety specification(s), within the range of outputs that the AI system may produce
Won't predicting safety specific variables contain all of the difficulty of predicting the world? (Because these variables can be mediated by arbitrary intermediate variables.) This sounds to me to be very similar to "we need to build an intepretable next-token predictor, but the next token predictor only needs to be as good as the model at predicting the lower case version of the text on just scientific papers". This is just as hard as building a full distribution next token predictor.
zvi on Ilya Sutskever and Jan Leike resign from OpenAI [updated]Here is my coverage of it. Given this is a 'day minus one' interview of someone in a different position, and given everything else we already know about OpenAI, I thought this went about as well as it could have. I don't want to see false confidence in that kind of spot, and the failure of OpenAI to have a plan for that scenario is not news.
tailcalled on tailcalled's ShortformTo quickly find the subspace that the model is using, I can use a binary search to find the number of singular vectors needed before the probability when clipping exceeds the probability when not clipping.
A relevant followup is what happens to other samples in response to the prompt when clipping. When I extrapolate "I believe the meaning of life is" using the 1886-dimensional subspace from
[I believe the meaning of life is] to be happy. It is a simple concept, but it is very difficult to achieve. The only way to achieve it is to follow your heart. It is the only way to live a happy life. It is the only way to be happy. It is the only way to be happy.
The meaning of life is
, I get:
[I believe the meaning of life is] to find happy. We is the meaning of life. to find a happy.
And to live a happy and. If to be a a happy.
. to be happy.
. to be happy.
. to be a happy.. to be happy.
. to be happy.
Which seems sort of vaguely related, but idk.
Another test is just generating without any prompt, in which case these vectors give me:
Question is a single thing to find. to be in the best to be happy. I is the only way to be happy.
I is the only way to be happy.
I is the only way to be happy.
It is the only way to be happy.. to be happy.. to be happy. to
Using a different prompt:
[Simply put, the theory of relativity states that ]1) the laws of physics are the same for all non-accelerating observers, and 2) the speed of light in a vacuum is the same for all observers, regardless of their relative motion or of the motion of the source of the light. Special relativity is a theory of the structure of spacetime
I can get a 3329-dimensional subspace which generates:
[Simply put, the theory of relativity states that ] 1) time is relative and 2) the speed of light in a vacuum is constant for all observers.
1) Time is relative, meaning that if two observers are moving relative to each other, the speed of light is the same for all observers, regardless of their motion. For example, if you are moving relative
or
Question: In a simple harmonic motion, the speed of an object is
A) constant
B) constant
C) constant
D) constant
In the physics of simple harmonic motion, the speed of an object is constant. The speed of the object can be constant, but the speed of an object can be
Another example:
[A brief message congratulating the team on the launch:
Hi everyone,
I just ] wanted to congratulate you all on the launch. I hope
that the launch went well. I know that it was a bit of a
challenge, but I think that you all did a great job. I am
proud to be a part of the team.Thank you for your
can yield 2696 dimensions with
[A brief message congratulating the team on the launch:
Hi everyone,
I just ] wanted to say you for the launch of the launch of the team.
The launch was successful and I am so happy to be a part of the team and I am sure you are all doing a great job.
I am very looking to be a part of the team.
Thank you all for your hard work,
or
def measure and is the definition of the new, but the
the is a great, but the
The is the
The is a
The is a
The is a
The
The is a
The
The
The is a
The
The is a
And finally,
[Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>] fromage
pink => rose
blue => bleu
red => rouge
yellow => jaune
purple => violet
brown => brun
green => vert
orange => orange
black => noir
white => blanc
gold => or
silver => argent
can yield the 2518-dimensional subspace:
[Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>] fromage
cheese => fromage
cheese => fromage
f cheese => fromage
butter => fromage
apple => orange
yellow => orange
green => vert
black => noir
blue => ble
purple => violet
white => blanc
or
tylerjohnston on New voluntary commitments (AI Seoul Summit)Question: A 201
The sum of a
The following
the sum
the time
the sum
the
the
the
The
The
The
The
The
The
The
The
The
The
The
The
The
The
The
The
The
The
Yeah, I think you're kind of right about why scaling seems like a relevant term here. I really like that RSPs are explicit about different tiers of models posing different tiers of risks. I think larger models are just likely to be more dangerous, and dangerous in new and different ways, than the models we have today. And that the safety mitigations that apply to them need to be more rigorous than what we have today. As an example, this framework naturally captures the distinction between "open-sourcing is great today" and "open-sourcing might be very dangerous tomorrow," which is roughly something I believe.
But in the end, I don't actually care what the name is, I just care that there is a specific name for this relatively specific framework to distinguish it from all the other possibilities in the space of voluntary policies. That includes newer and better policies — i.e. even if you are skeptical of the value of RSPs, I think you should be in favor of a specific name for it so you can distinguish it from other, future voluntary safety policies that you are more supportive of.
I do dislike that "responsible" might come off as implying that these policies are sufficient, or that scaling is now safe. I could see "risk-informed" having the same issue, which is why "iterated/tiered scaling policy" seems a bit better to me.