stag

Posts
Comments

Posts

Shallow review of technical AI safety, 2024 2024-12-29T12:01:14.724Z

Appendices to the live agendas 2023-11-27T11:10:32.187Z

Shallow review of live agendas in alignment & safety 2023-11-27T11:10:27.464Z

Comments

Comment by Stag on Shallow review of technical AI safety, 2024 · 2024-12-30T17:55:48.232Z · LW · GW

Very fair observation; my take is that a relevant continuation is occurring under OpenAI Alignment Science, but I would be interested in counterpoints - the main claim I am gesturing towards here is that the agenda is alive in other parts of the community, despite the previous flagship (and the specific team) going down.

Comment by Stag on Shallow review of technical AI safety, 2024 · 2024-12-30T16:53:19.262Z · LW · GW

As far as I understand, the banner is distinct - the team members seem not the same, but with meaningful overlap with the continuation of the agenda. I believe the most likely source of an error here is whether work is actually continuing in what could be called this direction. Do you believe the representation should be changed?

Comment by Stag on Shallow review of technical AI safety, 2024 · 2024-12-30T14:24:04.505Z · LW · GW

I think your comment adds a relevant critique of the criticism, but given that this comes from someone contributing to the project, I don't believe it's worth leaving it out altogether. I added a short summary and a hyperlink to a footnote.

Comment by Stag on Shallow review of technical AI safety, 2024 · 2024-12-30T14:15:23.083Z · LW · GW

Good point imo, expanded and added a hyperlink!

Comment by Stag on Shallow review of technical AI safety, 2024 · 2024-12-29T20:10:17.143Z · LW · GW

Would you agree that the entire agenda of collective intelligence is aimed at addressing 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, or does that cut off nuance?

Comment by Stag on Shallow review of live agendas in alignment & safety · 2023-11-27T23:24:04.569Z · LW · GW

Thanks, added!

Comment by Stag on Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions) · 2021-11-11T23:27:23.132Z · LW · GW

I really like the artistry of post-writing here; the introduction to and transition between the three videos felt especially great.

I've been internally using the term elemental for something in this neighborhood - Frame-Breaker elemental, Incentive-Slope elemental, etc. The term feels more totalizing (having two cup-stacking skills is easy to envision; being a several-thing elemental points in the direction of you being some mix of those things, and only those things), but some other connotations feel more on-target (like the difficulty of not doing the thing). I also like the term's aesthetics, but I could well be alone in that.

Comment by Stag on On the nature of purpose · 2021-02-07T17:56:23.703Z · LW · GW

I'm not sure I understand the cryptographer's constraint very well, especially with regard to language: individual words seem to have different meanings ("awesome", "literally", "love"). It's generally possible to infer which decryption was intended from the wider context, but sometimes the context itself will have different and mutually exclusive decryptions, such as in cases of real or perceived dogwhistling.

One way I could see this specific issue being resolved is by looking at what the intent of the original communication was - this would make it so that there is now a fact that settles which is the “correct” solution -, but that seems to fail in a different way: agents don't seem to have full introspective access to what they are doing or what the likely outcome of their actions is, such as in some cases of infidelity or making of promises.

This, too, could be resolved by saying that an agent's intention is "the outcomes they're attempting to instantiate regardless of self-awareness", but by that point it seems to me that we've agreed with Rosenberg's claim that it's Darwinian all the way down.

What am I missing?

Comment by Stag on Power Buys You Distance From The Crime · 2019-08-09T23:02:14.830Z · LW · GW

I might be missing the forest for the trees, but all of those still feel like they end up making some kinds of predictions based on the model, even if they're not trivial to test. Something like:

If Alice were informed by some neutral party that she took Bob's apple, Charlie would predict that she would not show meaningful remorse or try to make up for the damage done beyond trivial gestures like an off-hand "sorry" as well as claiming that some other minor extraction of resources is likely to follow, while Diana would predict that Alice would treat her overreach more seriously when informed of it. Something similar can be done on the meta-level.

None of these are slamdunks, and there are a bunch of reasons why the predictions might turn out exactly as laid out by Charlie or Diana, but that just feels like how Bayesian cookies crumble, and I would definitely expect evidence to accumulate over time in one direction or the other.

Strong opinion weakly held: it feels like an iterated version of this prediction-making and tracking over time is how our native bad actor detection algorithms function. It seems to me that shining more light on this mechanism would be good.

Comment by Stag on Off the Cuff Brangus Stuff · 2019-08-09T09:08:25.812Z · LW · GW

I am not one of the Old Guard, but I have an uneasy feeling about something related to the Chakra phenomenon.

It feels like there's a lot of hidden value clustered around wooy topics like Chakras and Tulpas, and the right orientation towards these topics seems fairly straightforward: if it calls out to you, investigate and, if you please, report. What feels less clear to me is how I as an individual or as a member of some broader rat community should respond when, according to me, people do not certain forms of bullshit tests.

This comes from someone with little interest or knowledge about the former, but after accidentally stumbling into some Tulpa-related territory and bumbling around in it for a while, it turns out that the Internal Family Systems model captures a large part of what I was grasping towards, this time with testable predictions and the whole deal.

I haven't given the individual-as-part-of-community thing that much thought, but my intuition is that I would make a poor judge for when to say "nope, your thing is BS" and I'm not sure what metric we might use to figure out who would make for a better judge besides overall faith in reasoning capability.

Comment by Stag on Unrolling social metacognition: Three levels of meta are not enough. · 2018-08-26T09:41:21.501Z · LW · GW

The complete unrolling of 2.5 (and thus 2.6) feel off if they are placed in the same chain of meta-reasoning. Specifically, Charlie doesn't seem like she's reacting to any chains at all, just the object-level aspect of Alex pegging Bailey as a downer. I can see how more layers of meta can arise in general, but in situations like these where a third person arrives after some events have already unfolded doesn't feel like it fits that model very well - is the claim that Charlie does a subconscious tree search for various values of X that might have caused such a chain of interactions, and then draws conclusions about the baselessness of the 'downer' brand based on that?

It seems that a large subset of issues in situations like these but perhaps more grave is that Bailey does indeed do 2.6 exactly as stated, except it's based on a non-existing chain in 2.5, leading to a quagmire of false understanding.

Comment by Stag on Terrorism, Tylenol, and dangerous information · 2018-05-12T23:12:14.415Z · LW · GW

A South Korean show by the name of "the Genius" is basically a case study in adaptive memes in a competitive environment, which might serve as an even better example. There are copycats, innovators and bystanders, and they all have varying levels of ingenuity and honor.

Comment by Stag on Affordance Widths · 2018-05-11T06:15:46.032Z · LW · GW

It seems to me that for any given {B}, the vast majority of Adams would deny {B} having this property, or at the very least deny that they are Adams in the given case. I think that's what it feels like from the inside, too - recognizing Adamness in oneself feels difficult, but it seems like a higher waterline in that regard is necessary to stop the phenomenon of useless or net-negative advice among other downstream consequences.

Comment by Stag on Melting Gold, and Organizational Capacity · 2018-05-07T01:12:06.792Z · LW · GW

In this vein, I would be very interested in hearing anecdotes about how easy mode events feel different from hard mode events. I don't think I've ever participated in an easy mode event that did not feel like a poor use of time, but that might be due to the environments where those happened (schools and universities).

User info

Posts

Comments