dirk's Shortform

post by dirk (abandon) · 2024-04-26T11:12:27.125Z · LW · GW · 26 comments

Contents

26 comments

26 comments

Comments sorted by top scores.

comment by dirk (abandon) · 2024-06-21T15:52:22.381Z · LW(p) · GW(p)

NSFW text and images are not dangerous, and avoiding them is not relevant to the field of AI safety.

Replies from: quetzal_rainbow, keltan
comment by quetzal_rainbow · 2024-06-22T07:48:08.308Z · LW(p) · GW(p)

I mean, yeah, but it is nice test field. If you can't remove ability to generate porn, you are very unlikely to remove ability to kill you.

comment by keltan · 2024-06-22T07:37:22.881Z · LW(p) · GW(p)

By “NSFW” do you mean pornography? Or are you also including Gore?

comment by dirk (abandon) · 2024-09-03T20:26:15.581Z · LW(p) · GW(p)

I'm confused about SAE feature descriptions. In Anthropic's and Google's demos both, there're a lot of descriptions that seem not to match a naked-eye reading of the top activations. (E.G. "Slurs targeting sexual orientation" also has a number of racial slurs in its top activations; the top activations for "Korean text, Chinese name yunfan, Unicode characters" are almost all the word "fused" in a metal-related context; etc.). I'm not sure if these short names are the automated Claude descriptions or if there are longer more accurate real descriptions somewhere; and if these are the automated descriptions, I'm not sure if there's some reason to think they're more accurate than they look, or if it doesn't matter if they're slightly off, or some third thing?

Replies from: neel-nanda-1, fred-zhang
comment by Neel Nanda (neel-nanda-1) · 2024-09-04T13:17:08.851Z · LW(p) · GW(p)

These are LLM generated labels, there are no "real" labels (because they're expensive!). Especially in our demo, Neuronpedia made them with gpt 3.5 which is kinda dumb.

I mostly think they're much better than nothing, but shouldn't be trusted, and I'm glad our demo makes this apparent to people! I'm excited about work to improve autointerp, though unfortunately the easiest way is to use a better model, which gets expensive

comment by Fred Zhang (fred-zhang) · 2024-09-05T04:21:10.443Z · LW(p) · GW(p)

One can think of this as cases where auto-interp exhibits a precision-recall trade-off. At one extreme, you can generate super broad annotations like "all English text" to capture a a lot, which would overkill; and at the other end, you can generate very specific ones like "Slurs targeting sexual orientation" which would risk mislabeling, say, racial slurs. 

Section 4.3 of the OpenAI SEA paper also discusses this point.

comment by dirk (abandon) · 2024-04-26T13:38:28.210Z · LW(p) · GW(p)

Sometimes a vague phrasing is not an inaccurate demarkation of a more precise concept, but an accurate demarkation of an imprecise concept

Replies from: cubefox, dkornai
comment by cubefox · 2024-04-26T17:24:19.506Z · LW(p) · GW(p)

Yeah. It's possible to give quite accurate definitions of some vague concepts, because the words used in such definitions also express vague concepts. E.g. "cygnet" - "a young swan".

comment by dkornai · 2024-04-26T15:32:37.212Z · LW(p) · GW(p)

I would say that if a concept is imprecise, more words [but good and precise words] have to be dedicated to faithfully representing the diffuse nature of the topic. If this larger faithful representation is compressed down to fewer words, that can lead to vague phrasing. I would therefore often view vauge phrasing as a compression artefact, rather than a necessary outcome of translating certain types of concepts to words. 

comment by dirk (abandon) · 2024-07-17T18:15:46.961Z · LW(p) · GW(p)

Proposal: a react for 'took feedback well' or similar, to socially reward people for being receptive to criticism

Replies from: Vladimir_Nesov
comment by Vladimir_Nesov · 2024-07-17T19:29:34.771Z · LW(p) · GW(p)

Something that sounds patronizing is not a social reward. It's not necessarily possible to formulate in a way that avoids this problem, without doing something significantly indirect. Right now this is upvoting for unspecified reasons.

Replies from: Benito, abandon
comment by Ben Pace (Benito) · 2024-07-17T22:59:38.883Z · LW(p) · GW(p)

Something like "I respect this comment"?

(Pretty close to upvote IMO, though it would be deanonymized)

comment by dirk (abandon) · 2024-07-17T19:35:13.653Z · LW(p) · GW(p)

I agree that such a react shouldn't be named in a patronizing fashion, but I didn't suggest doing so.

comment by dirk (abandon) · 2024-04-26T12:34:23.977Z · LW(p) · GW(p)

I'm against intuitive terminology [epistemic status: 60%] because it creates the illusion of transparency; opaque terms make it clear you're missing something, but if you already have an intuitive definition that differs from the author's it's easy to substitute yours in without realizing you've misunderstood.

Replies from: cubefox, CstineSublime
comment by cubefox · 2024-04-26T20:38:44.232Z · LW(p) · GW(p)

I agree. This is unfortunately often done in various fields of research where familiar terms are reused as technical terms.

For example, in ordinary language "organic" means "of biological origin", while in chemistry "organic" describes a type of carbon compound. Those two definitions mostly coincide on Earth (most such compounds are of biological origin), but when astronomers announce they have found "organic" material on an asteroid this leads to confusion.

Replies from: Mitchell_Porter
comment by Mitchell_Porter · 2024-04-27T01:29:40.617Z · LW(p) · GW(p)

Also astronomers: anything heavier than helium is a "metal"

comment by CstineSublime · 2024-04-30T00:38:55.285Z · LW(p) · GW(p)

How often is signalling a high degree of precision without the reader understanding the meaning of the term more important than conveying a imprecise but broadly within the subject matter understanding of the content?

comment by dirk (abandon) · 2024-04-26T12:14:17.157Z · LW(p) · GW(p)

I'm not alexithymic; I directly experience my emotions and have, additionally, introspective access to my preferences. However, some things manifest directly as preferences which I have been shocked to realize in my old age, were in fact emotions all along. (In rare cases these are stronger than the ones directly-felt even, despite reliably seeming on initial inspection to be simply neutral metadata).

Replies from: Viliam
comment by Viliam · 2024-04-26T18:24:14.014Z · LW(p) · GW(p)

Specific examples would be nice. Not sure if I understand correctly, but I imagine something like this:

You always choose A over B. You have been doing it for such long time that you forgot why. Without reflecting about this directly, it just seems like there probably is a rational reason or something. But recently, either accidentally or by experiment, you chose B... and realized that experiencing B (or expecting to experience B) creates unpleasant emotions. So now you know that the emotions were the real cause of choosing A over B all that time.

(This is probably wrong, but hey, people say that the best way to elicit answer is to provide a wrong one.)

Replies from: jam_brand
comment by jam_brand · 2024-04-27T02:23:21.762Z · LW(p) · GW(p)

Here's an example for you: I used to turn the faucet on while going to the bathroom, thinking it was due simply to having a preference for somewhat-masking the sound of my elimination habits from my housemates, then one day I walked into the bathroom listening to something-or-other via earphones and forgetting to turn the faucet on only to realize about halfway through that apparently I actually didn't much care about such masking, previously being able to hear myself just seemed to trigger some minor anxiety about it I'd failed to recognize, though its absence was indeed quite recognizable—no aural self-perception, no further problem (except for a brief bit of disorientation from the mental-whiplash of being suddenly confronted with the reality that in a small way I wasn't actually quite the person I thought I was), not even now on the rare occasion that I do end up thinking about such things mid-elimination anyway.

comment by dirk (abandon) · 2024-04-26T11:12:27.222Z · LW(p) · GW(p)

Classic type of argument-gone-wrong (also IMO a way autistic 'hyperliteralism' or 'over-concreteness' can look in practice, though I expect that isn't always what's behind it): Ashton makes a meta-level point X based on Birch's meta point Y about object-level subject matter Z. Ashton thinks the topic of conversation is Y and Z is only relevant as the jumping-off point that sparked it, while Birch wanted to discuss Z and sees X as only relevant insofar as it pertains to Z. Birch explains that X is incorrect with respect to Z; Ashton, frustrated, reiterates that Y is incorrect with respect to X. This can proceed for quite some time with each feeling as though the other has dragged a sensible discussion onto their irrelevant pet issue; Ashton sees Birch's continual returns to Z as a gotcha distracting from the meta-level topic XY, whilst Birch in turn sees Ashton's focus on the meta-level point as sophistry to avoid addressing the object-level topic YZ. It feels almost exactly the same to be on either side of this, so misunderstandings like this are difficult to detect or resolve while involved in one.

Replies from: abandon
comment by dirk (abandon) · 2024-04-26T11:37:40.697Z · LW(p) · GW(p)

Meta/object level is one possible mixup but it doesn't need to be that. Alternative example, is/ought: Cedar objects to thing Y. Dusk explains that it happens because Z. Cedar reiterates that it shouldn't happen, Dusk clarifies that in fact it is the natural outcome of Z, and we're off once more.

comment by dirk (abandon) · 2024-05-25T05:57:37.314Z · LW(p) · GW(p)

Necessary-but-not-sufficient condition for a convincing demonstration of LLM consciousness: a prompt which does not allude to LLMS, consciousness, or selfhood in any way.

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2024-05-25T16:58:49.034Z · LW(p) · GW(p)

I'm always much more interested in "conditional on an LLM being conscious, what would we be able to infer about what it's like to be it?" than the process of establishing the basic fact. This is related to me thinking there's a straightforward thing-it's-like-to-be a dog, duck, plant, light bulb, bacteria, internet router, fire, etc... if it interacts, then there's a subjective experience of the interaction in the interacting physical elements. Panpsychism of hard problem, compute dependence of easy problem. If one already holds this belief, then no LLM-specific evidence is needed to establish hard problem, and understanding the flavor of the easy problem is the interesting part.

Replies from: abandon
comment by dirk (abandon) · 2024-05-25T17:50:59.700Z · LW(p) · GW(p)

You also would not be able to infer anything about its experience because the text it outputs is controlled by the prompt.

Replies from: lahwran
comment by the gears to ascension (lahwran) · 2024-05-25T17:51:42.223Z · LW(p) · GW(p)

I am now convinced. In order to investigate, one must have some way besides prompts to do it. Something to do with the golden gate bridge, perhaps? Seems like more stuff like that could be promising. Since I'm starting from the assumption that it's likely, I'd want to check their consent first.