LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

raemon on Deep Honesty

I aspire to a kind of honesty that's similar to what's described here. I thought maybe this post was going overboard, but then it kept including caveats that feel similar to the caveats and specifics I go for.

One thing I might add or rephrase:

I think doing a good job with honesty, and having it be actually helpful, includes having a bunch of related soft social skills.

Sometimes the truth hurts people (which might in turn hurt you). One attitude here is "whelp, then either I must not care as much about truth as I thought" (because you aren't willing to inflict or take on that hurt). But another attitude is "learn the goddamn communication skills to present important truths in a way that hurts less."

(while you're still gaining those skills, one solution is various flavors of meta-honesty, which you touch on here. i.e. be clear to people 'hey, I won't directly lie, and I will try to tell you useful, unbiased info, but I won't always go out of my way to do so'. Another is to be like 'nope, I'mma be deeply honest all the times even when I'm too clumsy to do it without causing harm', which comes with upsides and downsides)

There's soft skills in "communicating to others without hurting them", (i.e. "tact") and there are also soft skills for absorbing information that might have otherwise hurt you, without getting hurt. (i.e. "thick skin"). Both seem worth investing in, if you want a world with more honesty in it.

wassname on We are headed into an extreme compute overhang

I think this only holds if fine tunes are composable, which as far as I can tell they aren't

Anecdotally, a lot of people are using mergekit to combine fine tunes

ghostwheel-1 on How do top AI labs vet architecture/algorithm changes?

Thanks for this answer! Interesting. It sounds like the process may be less systematized than how I imagined it to be.

nevin-wetherill on How do top AI labs vet architecture/algorithm changes?

Thanks! It's no problem :)

Agreed that the interview is worth watching in full for those interested in the topic. I don't think it answers your question in full detail, unless I've forgotten something they said - but it is evidence.

(Edit: Dwarkesh also posts full transcripts of his interviews to his website. They aren't obviously machine-transcribed or anything, more like what you'd expect from a transcribed interview in a news publication. You'll lose some body language/tone details from the video interview, but may be worth it for some people, since most can probably read the whole thing in less time than just watching the interview at normal speed.)

raemon on Raemon's Shortform

I've recently updated on how useful it'd be to have small icons representing users. Previously some people were like "it'll help me scan the comment section for people!" and I was like "...yeah that seems true, but I'm scared of this site feeling like facebook, or worse, LinkedIn."

I'm not sure whether that was the right tradeoff, but, I was recently sold after realizing how space-efficient it is for showing lots of commenters. Like, in slack or facebook, you'll see things like:

This'd be really helpful, esp. in the Quick Takes and Popular comments sections, where you can see which people you know/like have commented to a thing

ghostwheel-1 on How do top AI labs vet architecture/algorithm changes?

Dwarkesh's interview with Sholto sounds well worth watching in full, but the segments you've highlighted and your analyses are very helpful on their own. Thanks for the time and thought you put into this comment!

algon on Deep Honesty

John Carmack is a famously honest man. To illustrate this, I'll give you two stories. When Carmack was a kid, he desperately wanted the macs in his schools computer lab. So he and a buddy tried to steal some. They got caught because Carmack's friend was too fat to get through the window. Carmack went to juvie. When the counselor asked him if he wouldn't get caught, would he do it again? Carmack answered yes for this counterfactual.

Later, when working as a young developer, Carmack and his fellow employees would take the company workstations home to code games over the weekend. Their boss eventually noticed this and wondered if they were borrowing company property without permission. He quickly hit on a foolproof plan to catch them: just ask Carmack because he cannot tell a lie. Carmack said yes.

These stories aren't really a response to your point. I just find them to be hilarious examples of the inability to lie. They're also an existence proof of someone being unable to lie but still doing very well.

rotatingpaguro on Observations on Teaching for Four Weeks

broken link

oliver-daniels-koch on Oliver Daniels-Koch's Shortform

Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)

eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.

weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

measurement tampering detection is a class of weak to strong generalization problems, where the "weak" supervision consists of multiple measurements which are sufficient for supervision in the absence of "tampering" (where tampering is not yet formally defined)

mechanistic anomaly detection is an approach to ELK, where examples are flagged as anomalous if they cause the model to do things for "different reasons" then on a trusted dataset, where "different reasons" are defined w.r.t internal model cognition and structure.

mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)

so when developing benchmarks for mechanistic anomaly detection, we both want to test methods against methods in standard machine learning security problems (adversarial examples and trojans) that have similar structure to scalable oversight problems, against other elk approaches (e.g. CCS), and against other scalable oversight approaches (e.g. debate)

zach-stein-perlman on Questions for labs

Yay @Zac Hatfield-Dodds [LW · GW] of Anthropic for feedback and corrections including clarifying a couple of Anthropic's policies. Two pieces of not-previously-public information:

I was disappointed that Anthropic's Responsible Scaling Policy only mentions evaluation "During model training and fine-tuning." Zac told me "this was a simple drafting error - our every-three months evaluation commitment is intended to continue during deployment. This has been clarified for the next version, and we've been acting accordingly all along." Yay.
I said labs should have a "process for staff to escalate concerns about safety" and "have a process for staff and external stakeholders to share concerns about risk assessment policies or their implementation with the board and some other staff, including anonymously." I noted that Anthropic's RSP includes a commitment to "Implement a non-compliance reporting policy." Zac told me "Beyond standard internal communications channels, our recently formalized non-compliance reporting policy meets these criteria [including independence], and will be described in the forthcoming RSP v1.1." Yay.

I think it's cool that Zac replied (but most of my questions for Anthropic remain).

I have not yet received substantive corrections/clarifications from any other labs.

(I have not yet updated my site to reflect Zac's feedback; hopefully I will soon.)

LessWrong 2.0 Reader

Archive

Recent comments