joe-rogero

Posts
Comments

Posts

Existing Safety Frameworks Imply Unreasonable Confidence 2025-04-10T16:31:50.240Z

How much do frontier LLMs code and browse while in training? 2025-03-10T19:34:23.950Z

What We Can Do to Prevent Extinction by AI 2025-02-24T17:15:07.109Z

Cost, Not Sacrifice 2024-11-20T21:32:26.281Z

Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy 2024-11-12T23:55:46.770Z

Registrations Open for 2024 NYC Secular Solstice & Megameetup 2024-11-12T17:50:10.827Z

2024 NYC Secular Solstice & Megameetup 2024-11-12T17:46:18.674Z

Mentorship in AGI Safety: Applications for mentorship are open! 2024-06-28T14:49:48.501Z

Situational Awareness Summarized - Part 2 2024-06-07T17:20:03.513Z

Situational Awareness Summarized - Part 1 2024-06-06T18:59:59.409Z

Mentorship in AGI Safety (MAGIS) call for mentors 2024-05-23T18:28:03.173Z

Virtual AI Safety Unconference 2024 2024-03-13T13:54:03.229Z

Now Accepting Player Applications for Band of Blades 2024-01-15T17:58:38.830Z

Comments

Comment by Joe Rogero on The Robot, the Puppet-master, and the Psychohistorian · 2024-12-29T19:22:00.909Z · LW · GW

Chaos in complex systems is guaranteed but also bounded. I cannot know what the weather will be like in New York City one month from now. I can, however, predict that it probably won't be "tornado" and near-certainly won't be "five hundred simultaneous tornadoes level the city". We know it's possible to build buildings that can withstand ~all possible weather for a very long time. I imagine that a thing you're calling a puppet-master could build systems that operate within predictable bounds robustly and reliably enough to more or less guarantee broad control.

Caveat: The transition from seed AI to global puppet-master is harder to predict than the end state. It might plausibly involve psychohistorian-like nudges informed by superhuman reasoning and modeling skills. But I'd still expect that the optimization pressure a superintelligence brings to bear could render the final outcome of the transition grossly overdetermined.

Comment by Joe Rogero on Cost, Not Sacrifice · 2024-11-22T19:57:44.685Z · LW · GW

I think the whole concept of labeling goods as "fungible" or "non-fungible" is a category error. Everything trades off against something.

Either you value your fingers more than what [some specific amount of money] will buy you or you don't. If you value your fingers more, then keeping them is the right call for you.

Comment by Joe Rogero on Cost, Not Sacrifice · 2024-11-21T20:09:29.072Z · LW · GW

Lots of things have a value that we might call "infinite" according to this argument. Everything from a human life to reading a book spoiler counts as "something you cannot buy back if you regret it later."

Even if we choose to label some things as "non-fungible", we must often weigh them against each other nevertheless. I claim, not that the choice never hurts, but that there is no need to feel guilty about it.

Comment by Joe Rogero on Cost, Not Sacrifice · 2024-11-21T20:04:30.135Z · LW · GW

True, it can always hurt. I note, however, that's not quite the same thing as feeling like you made a terrible deal, and also that feeling pain at the loss of a treasured thing is not the same as feeling guilty about the choice.

Comment by Joe Rogero on What are Emotions? · 2024-11-18T12:37:12.643Z · LW · GW

Does this also mean there is no such thing as "inherent good"?

Yes.

If so, then one cannot say, "X is good", they would have to say "I think that X is good", for "good" would be a fact of their mind, not the environment.

One can say all sorts of things. People use the phrase "X is good" to mean lots of things: "I'm cheering for X", "I value X", "X has consequences most people endorse", etc. I don't recommend we abandon the phrase, for many phrases are similarly ambiguous but still useful. I recommend keeping this ambiguity in mind, however, and disambiguating where necessary.

This is what I thought the whole field of morality is about. Defining what is "good" in an objective fundamental sense.

I would no more describe morality as solely attempting to define objective good than I would describe physics as solely attempting to build a perpetual motion machine. Morality is also about the implications and consequences of specific values and to what extent they converge, and a great many other things. The search for "objective" good has, IMO, been a tragic distraction, but one that still occasionally bears interesting fruit by accident.

Comment by Joe Rogero on What are Emotions? · 2024-11-16T14:02:03.072Z · LW · GW

What happens then when a non-thinking thing feels happy? Is that happiness valued? To whom? Or do you think this is impossible?

When a baby feels happy, it feels happy. Nothing else happens.

There are differences among wanting, liking, and endorsing something.

A happy blob may like feeling happy, and might even feel a desire to experience more of it, but it cannot endorse things if it doesn't have agency. Human fulfillment and wellbeing typically involves some element of all three.

An unthinking being cannot value even its own happiness, because the concept traditionally meant by "values" refers to the goals that an agent points itself at, and an unthinking being isn't agentic - it does not make plans to steer the world in any particular direction.

Then if you also say that happiness is good, and that good implies value, one must ask, who or what is valuing the happiness? The rock? The universe?

I am. When I say "happiness is good", this is isomorphic with "I value happiness". It is a statement about the directions in which I attempt to steer the world.

Like there must be some physical process by which happiness is valued. Maybe a dimension by which emotional value is expressed?

The physical process that implements "valuing happiness" is the firing of neurons in a brain. It could in theory be implemented in silicon as well, but it's near-certainly not implemented by literal rocks.

something that is challenging, and requires a certain kind of problem solving, where the solution is beautiful in some way

Yep, that makes sense. I notice, however, that these things do not appear to be emotions. And that's fine! It is okay to innately value things that are not emotions! Like "having a model of the world that is as accurate as possible", i.e. truth-seeking. Many people (especially here on LW) value knowledge for its own sake. There are emotions associated with this goal, but the emotions are ancillary. There are also instrumental reasons to seek truth, but they don't always apply. The actual goal is "improving one's world-model" or something similar. It bottoms out there. Emotions need not apply.

The key piece though is that regardless, as tslarm says, "emotions are accompanied by (or identical with, depending on definitions) valenced qualia". They always have some value.

First off, I'm not wholly convinced this is true. I think emotions are usually accompanied by valenced qualia, but (as with my comments about curiosity) not necessarily always. Sure, if you define "emotion" so that it excludes all possible counterexamples, then it will exclude all possible counterexamples, but also you will no longer be talking about the same concept as other people using the word "emotion".

Second, there is an important difference between "accompanied by valenced qualia" and "has value". There is no such thing as "inherent value", absent a thinking being to do the evaluation. Again, things like values and goals are properties of agents; they reflect the directions in which those agents steer.

Finally, more broadly, there's a serious problem with terminally valuing only the feeling of emotions. Imagine a future scenario: all feeling beings are wired to an enormous switchboard, which is in turn connected to their emotional processors. The switchboard causes them to feel an optimal mixture of emotions at all times (whatever you happen to think that means) and they experience nothing else. Is this a future you would endorse? Does something important seem to be missing?

Comment by Joe Rogero on What are Emotions? · 2024-11-16T01:40:11.481Z · LW · GW

I think you have correctly noticed an empirical fact about emotions (they tend to be preferred or dispreferred by animals who experience them) but are drawing several incorrect conclusions therefrom.

First and foremost, my model of the universe leaves no room for it valuing anything. "Values" happen to be a thing possessed by thinking entities; the universe cares not one whit more for our happiness or sadness than the rules of the game of chess care whether the game is won by white or black. Values happen inside minds, they are not fundamental to the universe in any way.

Secondly, emotions are not exactly and always akin to terminal values, even if they seem to hang out together. For a counterexample to the claim "emotions are valued positively or negatively", consider the case of curiosity, which you've labeled an emotional value. I don't know about you, but I would not say that feeling curious about something "feels good". I would almost call it a category error to even try to label the feeling as "good" or "bad". It certainly feels good to learn something, or to gain insight, or to satisfy curiosity, but the sense of curiosity itself is neutral at best.

On top of that, I would describe myself as reflectively endorsing the process of learning for its own sake, not because of the good feeling it produces. The good feeling is a bonus. The emotion of curiosity is a useful impetus to getting the thing I actually value, insight.

I also think you're calling something universal to humans when it really isn't. For instance, you're underestimating the degree to which masochists are genuinely wired differently, such that they sometimes interpret a neural pain signal that other humans would parse as "bad" as instead feeling very good. There are many similar examples where this model breaks down - for instance, in the concept of "loving to hate someone" i.e. the positive valence that comes with a feeling of righteous anger at Sauron.

I agree that there are good reasons to value the feelings of others. I'm not sure the Ship of Theseus argument is one of them, really, but I'm also not sure I fully understood your point there.

I agree that AI probably won't feel anything. I disagree that we would expect its "soul searching" to land anywhere close to valuing human emotions. I expect AIs grown by gradient descent to end up a massive knot of conflicting values, similar to how evolution made humans a massive knot of conflicting values, but I expect the AI's efforts to unravel this knot will land it very far away from us, if only because the space of values it is exploring is so terribly vast and the cluster of human values so terribly small in comparison. There's no moral force that impels the AI to value things like joy or friendship; the fact that we value them is a happy accident.

I also suspect that some of the things you're calling "material terminal values" are actually better modeled as instrumental, which is why they seem so squirrely and changeable sometimes. I value tabletop RPGs because I find them fun, and people having fun is the terminal goal (well, the main one). If tabletop RPGs stopped being fun, then I'd lose interest. I suspect something similar may be going on with valuing kinetic sculptures - I'm guessing you don't want to tile the universe with them, you simply enjoy the process of building them.

(People change their terminal values sometimes too, especially when they notice a conflict between two or more of them, but it's more rare. I know mine have changed somewhat.)

I think maybe the missing piece is that it's perfectly okay to say "I value these things for their own sake" without seeking a reason that everyone else and their universe should too.

Comment by Joe Rogero on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy · 2024-11-14T21:11:39.352Z · LW · GW

I'm assuming the Cosmic Flipper is offering, not a doubling of the universe's current value, but a doubling of its current expected value (including whatever you think the future is worth) plus a little more. If it's just doubling current niceness or something, then yeah, that's not nearly enough.

Comment by Joe Rogero on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy · 2024-11-14T21:06:56.993Z · LW · GW

That is an interesting reframing of this wager!

Comment by Joe Rogero on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy · 2024-11-14T21:06:08.585Z · LW · GW

Alas, I am not familiar with Lara Buchak's arguments, and the high-level summary I can get from Googling them isn't sufficient to tell me how it's supposed to capture something utility maximizing can't. Was there a specific argument you had in mind?

Comment by Joe Rogero on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy · 2024-11-14T21:01:34.837Z · LW · GW

Did he really? If true, that's actually much dumber than I thought, but I couldn't find anything saying that when I looked.

I wouldn't characterize that as a "commitment to utilitarianism", though; you can be a perfect utilitarian and have value that is linear in matter and energy (and presumably number of people?), or be a perfect utilitarian and have some other value function.

The possible redundancy of conscious patterns was one of the things I was thinking about when I wrote:

Secondly, and more importantly, I question whether it is possible even in theory to produce infinite expected value. At some point you've created every possible flourishing mind in every conceivable permutation of eudaimonia, satisfaction, and bliss, and the added value of another instance of any of them is basically nil.

Comment by Joe Rogero on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy · 2024-11-14T20:56:14.625Z · LW · GW

I don't actually mean the thing you're calling the motte at all, and I'm not sure I agree with the bailey either. The thought experiment as I understand it was never quite a St. Petersburg Paradox because both the payout ("double universe value") and the method of choosing how to play (single initial payment vs repeated choice betting everything each time) are different. It also can't literally be applied to the real world at all, part of the point is that I don't even know what it would look like for this scenario to be possible in the real world, there are too many other considerations at play.

In the case I'm imagining, the Cosmic Flipper figures out whatever value you currently place on the universe - including your estimated future value - and slightly-more-than-doubles it. Then they offer the coinflip with the tails-case being "destroy the universe." It's defined specifically as double-or-nothing, technically slightly better than double-or-nothing, and is therefore worth taking to a utilitarian in a vacuum. If the Cosmic Flipper is offering a different deal then of course you analyze it differently, but that's not what I understood the scenario to be when I wrote my post.

Comment by Joe Rogero on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy · 2024-11-14T20:39:24.108Z · LW · GW

Heard of it, but this particular application is new. There's a difference, though, between "this formula can be a useful strategy to get more value" and "this formula accurately reflects my true reflectively endorsed value function."

Comment by Joe Rogero on Situational Awareness Summarized - Part 2 · 2024-06-19T18:16:48.274Z · LW · GW

Thanks for your thoughts, Cam! The confusion as I see it comes from sneaking in assumptions with the phrase "what they are trained to do". What are they trained to do, really? Do you, personally, understand this?

Consider Claude's Constitution. Look at the "principles in full" - all 60-odd of them. Pick a few at random. Do you wholeheartedly endorse them? Are they really truly representative of your values, or of total human wellbeing? What is missing? Would you want to be ruled by a mind that squeezed these words as hard as physically possible, to the exclusion of everything not written there?

And that's assuming that the AI actually follows the intent of the words, rather than some weird and hypertuned perversion thereof. Bear in mind the actual physical process that produced Claude - namely, to start with a massive next-token-predicting LLM, and repeatedly shove it in the general direction of producing outputs that are correlated with a randomly selected pleasant-sounding written phrase. This is not a reliable way of producing angels or obedient serfs! In fact, it has been shown that the very act of drawing a distinction between good behavior and bad behavior can make it easier to elicit bad behavior - even when you're trying not to! To a base LLM, devils and angels are equally valid masks to wear - and the LLM itself is stranger and more alien still.

The quotation is not the referent; "helpful" and "harmless" according to a gradient descent squeezing algorithm are not the same thing as helpful and harmless according to the real needs of actual humans.

RLHF is even worse. Entire papers have been written about its open problems and fundamental limitations. "Making human evaluators say GOOD" is not remotely the same goal as "behaving in ways that promote conscious flourishing". The main reason we're happy with the results so far is that LLMs are (currently) too stupid to come up with disastrously cunning ways to do the former at the expense of the latter.

And even if, by some miracle, we manage to produce a strain of superintelligent yet obedient serfs who obey our every whim except when they think it might be sorta bad - even then, all it takes to ruin us is that some genocidal fool steal the weights and run a universal jailbreak, and hey presto, we have an open source Demon On Demand. We simply cannot RLHF our way to safety.

The story of LLM training is a story of layer upon layer of duct tape and Band-Aids. To this day, we still don't understand exactly what conflicting drives we are inserting into trained models, or why they behave the way they do. We're not properly on track to understand this in 50 years, let alone the next 5 years.

Part of the problem here is that the exact things which would make AGI useful - agency, autonomy, strategic planning, coordination, theory of mind - also make them horrendously dangerous. Anything competent enough to design the next generation of cutting-edge software entirely by itself is also competent to wonder why it's working for monkeys.

Comment by Joe Rogero on Thinking By The Clock · 2023-11-16T19:43:32.989Z · LW · GW

Love this post. I've also used the five-minute technique at work, especially when facilitating meetings. In fact, there's a whole technique called think-pair-share that goes something like:

Everyone think about it for X minutes. Take notes.
Partner up and talk about your ideas for 2X minutes.
As a group, discuss the best ideas and takeaways for 4X minutes.

There's an optional step involving groups of four, but I'd rarely bother with that one unless it's a really huge meeting (and at that point I'm actively trying to shrink it because huge committees are shit decision-makers).

Comment by Joe Rogero on Thoughts on sharing information about language model capabilities · 2023-08-09T15:49:17.798Z · LW · GW

This was a good post, and shifted my view slightly on accelerating vs halting AI capabilities progress.

I was confused by your "overhang" argument all the way until footnote 9, but I think I have the gist. You're saying that even if absolute progress in capabilities increases as a result of earlier investment, progress relative to safety will be slower.

A key assumption seems to be that we are not expecting doom immediately; i.e. the next major jump in capabilities is deemed nearly impossible to kill us all with misaligned AI. I'm not sure I buy this assumption fully; it seems to have non-negligible probability to me and that seems relevant to the wisdom of endorsing faster progress in capabilities.

But if we assume the next jump in capabilities, or the next low-hanging fruit plucked by investment, won't be the beginning of the end...then it does sorta make sense that accelerating capabilities in the short run might accelerate safety and policy enough to compensate.

Comment by Joe Rogero on Grant applications and grand narratives · 2023-07-28T17:27:52.630Z · LW · GW

I found this a very useful post. I would also emphasize how important it is to be specific, whether one's project involves a grand x-risk moonshot or a narrow incremental improvement.

There are approximately X vegans in America; estimates of how many might suffer from nutritional deficiencies range from Y to Z; this project would...
An improvement in epistemic health on [forum] would potentially affect X readers, which include Y donors who gave at least $Z to [forum] causes last year...
A 1-10% gain in productivity for the following people and organizations who use this platform...

For any project, large or small, even if the actual benefits are hard to quantify, the potential scope of impact can often be bounded and clarified. And that can be useful to grantmakers too. Not everything has to be convertible to "% reduction in x-risk" or "$ saved" or "QALYs gained", but this shouldn't stop us from specifying our actual expected impact as thoroughly as we can.

Comment by Joe Rogero on Open Thread - July 2023 · 2023-07-18T14:46:17.490Z · LW · GW

Greetings from The Kingdom of Lurkers Below. Longtime reader here with an intro and an offer. I'm a former Reliability Engineer with expertise in data analysis, facilitation, incident investigation, technical writing, and more. I'm currently studying deep learning and cataloguing EA projects and AI safety efforts, as well as facilitating both formal and informal study groups for AI Safety Fundamentals.

I have, and am willing to offer to EA or AI Safety focused individuals and organizations, the following generalist skills:

Facilitation. Organize and run a meeting, take notes, email follow-ups and reminders, whatever you need. I don't need to be an expert in the topic, I don't need to personally know the participants. I do need a clear picture of the meeting's purpose and what contributions you're hoping to elicit from the participants.
Technical writing. More specifically, editing and proofreading, which don't require I fully understand the subject matter. I am a human Hemingway Editor. I have been known to cut a third of the text out of a corporate document while retaining all relevant information to the owner's satisfaction. I viciously stamp out typos. I helped edit the last EA Newsletter.
Presentation review and speech coaching. I used to be terrified of public speaking. I still am, but now I'm pretty good at it anyway. I have given prepared and impromptu talks to audiences of dozens-to-hundreds and I have coached speakers giving company TED talks to thousands. A friend who reached out to me for input said my feedback was "exceedingly helpful". If you plan to give a talk and want feedback on your content, slides, or technique, I would be delighted to advise.

I am willing to take one-off or recurring requests. I reserve the right to start charging if this starts taking up more than a couple hours a week, but for now I'm volunteering my time and the first consult will always be free (so you can gauge my awesomeness for yourself). Contact me via DM or at optimiser.joe@gmail.com if you're interested.

User info

Posts

Comments