LessWrong 2.0 Reader
View: New · Old · Topnext page (older posts) →
next page (older posts) →
I agree in principle that labs have the responsibility to dispel myths about what they're committed to. OTOH, in defense of the labs I imagine that this can be hard to do while you're in the middle of negotiations with various AISIs about what those commitments should look like.
dusandnesic on Deep HonestyThis sounds like a case of the Rule of Equal and Opposite Advice: https://slatestarcodex.com/2014/03/24/should-you-reverse-any-advice-you-hear/ I'm sure for some people more honesty would be harmful, but it does sound like the caveats here make it clear when not to use it. I more agree with questions Tsvi raises in the other thread than with "this is awful advice". I can imagine that you are a person for whom more honesty is bad, although if you followed the caveats above it would be imo quite rare to do it wrong. I think the authors do a good job of outlining many cases where it goes wrong.
martinsq on DanielFilan's Shortform FeedInteresting, but I'm not sure how successful the counterexample is. After all, if your terminal goal in the whole environment was truly for your side to win, then it makes sense to understand anything short of letting Shin play as a shortcoming of your optimization (with respect to that goal). Of course, even in the case where that's your true goal and you're committing a mistake (which is not common), we might want to say that you are deploying a lot of optimization, with respect to the different goal of "winning by yourself", or "having fun", which is compatible with failing at another goal.
This could be taken to absurd extremes [LW · GW] (whatever you're doing, I can understand you as optimizing really hard for doing exactly what you're doing), but the natural way around that is for your imputed goals to be required simple (in some background language or ontology, like that of humans). This is exactly the approach mathematically taken by Vanessa in the past (the equation at 3:50 here).
I think this "goal relativism" is fundamentally correct. The only problem with Vanessa's approach is that it's hard to account for the agent being mistaken (for example, you not knowing Shin is behind you).[1]
I think the only natural way to account for this is to see things from the agent's native ontology (or compute probabilities according to their prior), however we might extract those from them. So we're unavoidably back at the problem of ontology identification (which I do think is the core problem).
Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
A naive application of Vanessa's criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent's prior, and we're back at ontology identification.
(Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)
Thanks for bringing up the comparison points of radical honesty and explicit honesty. It does seem like deep honesty is in between the two.
But the characterization of deep honesty that you've posited doesn't feel very respectful? It leaves space to patronizingly share things the listener doesn't want to hear, because you've determined that they're relevant. Our notion of deep honesty is closer to being grounded in a notion of respect, perhaps something like "being completely honest about information you perceive that the receiver would want, regardless of whether the information has explicitly been requested". Sometimes that could involve some leaving of trailheads, or testing of the waters, to ascertain whether the person does in fact want the information.
As to "when should this apply", it's maybe something like "when you're trying to cooperate with the other party". Of course there's still room for this to go wrong (in the first example you link it seems like the person was trying to cooperate with their boss, who didn't reciprocate), but it does seem like a pretty important safety valve compared to radical honesty.
I’m an adult from the UK and learnt the word faucet like last year
richard_kennaway on Introducing AI Lab WatchAI Corporation Watch | AI Mega-Corp Watch | AI Company Watch | AI Industry Watch | AI Firm Watch | AI Behemoth Watch | AI Colossus Watch | AI Juggernaut Watch | AI Future Watch
These are either tendentious ("Juggernaut") or unnecessarily specific to the present moment ("Mega-Corp").
How about simply "AI Watch"?
review-bot on 6 non-obvious mental health issues specific to AI safetyThe LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?
cubefox on Mati_Roy's ShortformI guess for a cat classifier disentanglement is not possible, because it wants to classify things as cats if and only if it believes they are cats. Since values and beliefs are perfectly correlated here, there is no test we could perform which would distinguish what it wants from what it believes.
Though we could assume we don't know what the classifier wants. If it doesn't classify a cat image as "yes", it could be because it is (say) actually a dog classifier, and it correctly believes the image contains something other than a dog. Or it could be because it is indeed a cat classifier, but it mistakenly believes the image doesn't show a cat.
One way to find out would be to give the classifier an image of the same subject, but in higher resolution or from another angle, and check whether it changes its classification to "yes". If it is a car classifier, it is likely it won't make the mistake again, so it probably change its classification to "yes". If it is a dog classifier, it will likely stay with "no".
This assumes that mistakes are random and somewhat unlikely, so will probably disappear when the evidence is better or of a different sort. Beliefs react to changes in evidence of that sort, while values don't.
arisalexis on How to be an amateur polyglotYes I didn't even know the difference :) I thought tap is only for pub beer ! Totally disconnected from the exams where you only dealt with essays
the-gears-to-ascension on some thoughts on LessOnlineI’m only a year old rationalist
you write really eloquently for your age! and being in uni! wow. I was still learning to walk. kids are so precocious these days
⸮