LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
Not that I know of, but I will at least consider periodically pinging him on X (if this post gets enough people’s attention).
ernestscribbler on ACX Covid Origins Post convinced readersThis is a very good point.
daniel-kokotajlo on Shane Legg's necessary properties for every AGI Safety planDid Shane leave a way for people who didn't attend the talk, such as myself, to contact him? Do you think he would welcome such a thing?
daniel-kokotajlo on Shane Legg's necessary properties for every AGI Safety planThanks for sharing!
So, it seems like he is proposing something like AutoGPT except with a carefully thought-through prompt laying out what goals the system should pursue and constraints the system should obey. Probably he thinks it shouldn't just use a base model but should include instruction-tuning at least, and maybe fancier training methods as well.
This is what labs are already doing? This is the standard story for what kind of system will be built, right? Probably 80%+ of the alignment literature is responding at least in part to this proposal. Some problems:
Damn. And I tried the strategy "what if I try to predict it only off the text, without looking at csv" :D
carl-feynman on Ironing Out the SquigglesAn interesting question! I looked in “Towards Deep Learning Models Resistant to Adversarial Attacks” to see what they had to say on the question. If I’m interpreting their Figure 6 correctly, there’s a negligible increase in error rate as epsilon increases, and then at some point the error rate starts swooping up toward 100%. The transition seems to be about where the perturbed images start to be able to fool humans. (Or perhaps slightly before.). So you can’t really blame the model for being fooled, in that case. If I had to pick an epsilon to train with, I would pick one just below the transition point, where robustness is maximized without getting into the crazy zone.
All this is the result of a cursory inspection of a couple of papers. There’s about a 30% chance I’ve misunderstood.
sweenesm on Shane Legg's necessary properties for every AGI Safety planI basically agree with Shane's take for any AGI that isn't trying to be deceptive with some hidden goal(s).
(Btw, I haven't seen anyone outline exactly how an AGI could gain it's own goals independently of goals given to it by humans - if anyone has ideas on this, please share. I'm not saying it won't happen, I'd just like a clear mechanism for it if someone has it. Note: I'm not talking here about instrumental goals such as power seeking.)
What I find a bit surprising is the relative lack of work that seems to be going on to solve condition 3: specification of ethics for an AGI to follow. I have a few ideas on why this may be the case:
Personally, I think we better have a consistent system of ethics for an AGI to follow ASAP because we'll likely be in significant trouble if malicious AGI come online and go on the offensive before we have at least one ethics-guided AGI to help defend us in a way that minimizes collateral damage.
monte-m on Mechanistically Eliciting Latent Behaviors in Language ModelsCongrats Andrew and Alex! These results are really cool, both interesting and exciting.
buck on Introducing AI Lab WatchOne reason that I'm particularly excited for this: AI-x-risk-concerned people are often accused of supporting Anthropic over other labs for reasons that are related to social affiliation rather than substantive differences. I think these accusations have some merit--if you ask AI-x-risk-concerned people for exactly how Anthropic differs from e.g. OpenAI, they often turn out to have a pretty shallow understanding of the difference. This resource makes it easier for these people to have a firmer understanding of concrete differences.
I hope also that this project makes it easier for AI-x-risk-concerned people to better allocate their social pressure on labs.
oliver-daniels-koch on Benchmarks for Detecting Measurement Tampering [Redwood Research]is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i))
b) AUROC(all(sensor_preds), all(sensors))