LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

jacques-thibodeau on Shane Legg's necessary properties for every AGI Safety plan

Not that I know of, but I will at least consider periodically pinging him on X (if this post gets enough people’s attention).

https://x.com/jacquesthibs/status/1785704284434129386?s=46

ernestscribbler on ACX Covid Origins Post convinced readers

This is a very good point.

daniel-kokotajlo on Shane Legg's necessary properties for every AGI Safety plan

Did Shane leave a way for people who didn't attend the talk, such as myself, to contact him? Do you think he would welcome such a thing?

daniel-kokotajlo on Shane Legg's necessary properties for every AGI Safety plan

Thanks for sharing!

So, it seems like he is proposing something like AutoGPT except with a carefully thought-through prompt laying out what goals the system should pursue and constraints the system should obey. Probably he thinks it shouldn't just use a base model but should include instruction-tuning at least, and maybe fancier training methods as well.

This is what labs are already doing? This is the standard story for what kind of system will be built, right? Probably 80%+ of the alignment literature is responding at least in part to this proposal. Some problems:

How do we get a model that is genuinely robustly trying to obey the instruction text, instead of e.g. choosing actions on the basis of a bunch of shards of desire/drives that were historically reinforced, and which are not well-modelled as "it just tries to obey the instruction text?" Or e.g. trying to play the training game well? Or e.g. nonrobustly trying to obey the instruction text, such that in some future situation it might cease obeying and do something else? (See e.g. Ajeya Cotra's training game report)
Suppose we do get a model that is genuinely robustly trying to obey the instruction text. How do we ensure that its concepts are sufficiently similar to ours, that it interprets the instructions in the ways we would have wanted them to be interpreted? (This part is fairly easy I think but probably not trivial and at least conceptually deserves mention)
Suppose we solve both of the above problems, what exactly should the instruction text be? Normal computers 'obey' their 'instructions' (the code) perfectly, yet unintended catastrophic side-effects (bugs) are typical. Another analogy is to legal systems -- laws generally end up with loopholes, bugs, unintended consequences, etc... (As above, this part seems to me to be probably fairly easy but nontrivial.)

dmitrii-zelenskii-1 on D&D.Sci Evaluation and Ruleset

Damn. And I tried the strategy "what if I try to predict it only off the text, without looking at csv" :D

carl-feynman on Ironing Out the Squiggles

An interesting question! I looked in “Towards Deep Learning Models Resistant to Adversarial Attacks” to see what they had to say on the question. If I’m interpreting their Figure 6 correctly, there’s a negligible increase in error rate as epsilon increases, and then at some point the error rate starts swooping up toward 100%. The transition seems to be about where the perturbed images start to be able to fool humans. (Or perhaps slightly before.). So you can’t really blame the model for being fooled, in that case. If I had to pick an epsilon to train with, I would pick one just below the transition point, where robustness is maximized without getting into the crazy zone.

All this is the result of a cursory inspection of a couple of papers. There’s about a 30% chance I’ve misunderstood.

sweenesm on Shane Legg's necessary properties for every AGI Safety plan

I basically agree with Shane's take for any AGI that isn't trying to be deceptive with some hidden goal(s).

(Btw, I haven't seen anyone outline exactly how an AGI could gain it's own goals independently of goals given to it by humans - if anyone has ideas on this, please share. I'm not saying it won't happen, I'd just like a clear mechanism for it if someone has it. Note: I'm not talking here about instrumental goals such as power seeking.)

What I find a bit surprising is the relative lack of work that seems to be going on to solve condition 3: specification of ethics for an AGI to follow. I have a few ideas on why this may be the case:

Most engineers care about making things work in the real-world, but don't want the responsibility to do this for ethics because: 1) it's not their area of expertise, and 2) they'll likely take on major blame if they get things "wrong" (and it's almost guaranteed that someone won't like their system of ethics and say they got it "wrong")
Most philosophers haven't had to care much about making things work in the real-world, and don't seem excited about possibly having to make engineering-type compromises in their system of ethics to make it work
Most people who've studied philosophy at all probably don't think it's possible to come up with a consistent system of ethics to follow, or at least they don't think people will come up with it anytime soon, but hopefully an AGI might

Personally, I think we better have a consistent system of ethics for an AGI to follow ASAP because we'll likely be in significant trouble if malicious AGI come online and go on the offensive before we have at least one ethics-guided AGI to help defend us in a way that minimizes collateral damage.

monte-m on Mechanistically Eliciting Latent Behaviors in Language Models

Congrats Andrew and Alex! These results are really cool, both interesting and exciting.

buck on Introducing AI Lab Watch

One reason that I'm particularly excited for this: AI-x-risk-concerned people are often accused of supporting Anthropic over other labs for reasons that are related to social affiliation rather than substantive differences. I think these accusations have some merit--if you ask AI-x-risk-concerned people for exactly how Anthropic differs from e.g. OpenAI, they often turn out to have a pretty shallow understanding of the difference. This resource makes it easier for these people to have a firmer understanding of concrete differences.

I hope also that this project makes it easier for AI-x-risk-concerned people to better allocate their social pressure on labs.

oliver-daniels-koch on Benchmarks for Detecting Measurement Tampering [Redwood Research]

is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i))

b) AUROC(all(sensor_preds), all(sensors))

LessWrong 2.0 Reader

Archive

Recent comments