Posts

Can efficiency-adjustable reporting thresholds close a loophole in Biden’s executive order on AI? 2024-06-11T20:56:16.167Z
How do top AI labs vet architecture/algorithm changes? 2024-05-08T16:47:07.664Z
“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes? 2023-03-29T15:56:39.355Z
How might we make better use of AI capabilities research for alignment purposes? 2022-08-31T04:19:33.239Z

Comments

Comment by Jemal Young (ghostwheel) on The Sun is big, but superintelligences will not spare Earth a little sunlight · 2024-09-25T18:15:28.891Z · LW · GW

You only set aside occasional low-value fragments for national parks, mostly for your own pleasure and convenience, when it didn't cost too much?

Earth as a proportion of the solar system's planetary mass is probably comparable to national parks as a proportion of the Earth's land, if not lower.

Maybe I've misunderstood your point, but if it's that humanity's willingness to preserve a fraction of Earth for national parks is a reason for hopefulness that ASI may be willing to preserve an even smaller fraction of the solar system (namely, Earth) for humanity, I think this is addressed here:

it seems like for Our research purposes simulations would be just as good. In fact, far better, because We can optimize the hell out of them, running it on the equivalent of a few square kilometers of solar diameter

"research purposes" involving simulations can be a stand-in for any preference-oriented activity. Unless ASI would have a preference for letting us, in particular, do what we want with some fraction of available resources, no fraction of available resources would be better left in our hands than put to good use.

Comment by Jemal Young (ghostwheel) on What's a better term now that "AGI" is too vague? · 2024-05-31T18:30:13.578Z · LW · GW

I think the kind of AI you have in mind would be able to:

continue learning after being trained

think in an open-ended way after an initial command or prompt

have an ontological crisis

discover and exploit signals that were previously unknown to it

accumulate knowledge

become a closed-loop system

The best term I've thought of for that kind of AI is Artificial Open Learning Agent.

Comment by Jemal Young (ghostwheel) on How do top AI labs vet architecture/algorithm changes? · 2024-05-08T23:44:56.208Z · LW · GW

Thanks for this answer! Interesting. It sounds like the process may be less systematized than how I imagined it to be.

Comment by Jemal Young (ghostwheel) on How do top AI labs vet architecture/algorithm changes? · 2024-05-08T23:27:03.690Z · LW · GW

Dwarkesh's interview with Sholto sounds well worth watching in full, but the segments you've highlighted and your analyses are very helpful on their own. Thanks for the time and thought you put into this comment!

Comment by Jemal Young (ghostwheel) on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research · 2023-08-27T06:30:09.794Z · LW · GW

I like this post, and I think I get why the focus is on generative models.

What's an example of a model organism training setup involving some other kind of model?

Comment by Jemal Young (ghostwheel) on Which possible AI systems are relatively safe? · 2023-08-23T17:39:50.582Z · LW · GW

Maybe relatively safe if:

  • Not too big
  • No self-improvement
  • No continual learning
  • Curated training data, no throwing everything into the cauldron
  • No access to raw data from the environment
  • Not curious or novelty-seeking
  • Not trying to maximize or minimize anything or push anything to the limit
  • Not capable enough for catastrophic misuse by humans
Comment by Jemal Young (ghostwheel) on What are the best non-LW places to read on alignment progress? · 2023-07-07T17:20:22.626Z · LW · GW

Here are some resources I use to keep track of technical research that might be alignment-relevant:

  • Podcasts: Machine Learning Street Talk, The Robot Brains Podcast
  • Substacks: Davis Summarizes Papers, AK's Substack

How I gain value: These resources help me notice where my understanding breaks down i.e. what I might want to study, and they get thought-provoking research on my radar.

Comment by Jemal Young (ghostwheel) on Think carefully before calling RL policies "agents" · 2023-06-05T04:22:08.920Z · LW · GW

I'm very glad to have read this post and "Reward is not the optimization target". I hope you continue to write "How not to think about [thing] posts", as they have me nailed. Strong upvote.

Comment by Jemal Young (ghostwheel) on “Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes? · 2023-03-29T22:28:20.943Z · LW · GW

Thanks for pointing me to these tools!

Comment by Jemal Young (ghostwheel) on What I mean by "alignment is in large part about making cognition aimable at all" · 2023-01-31T23:44:10.380Z · LW · GW

I believe that by the time an AI has fully completed the transition to hard superintelligence

Nate, what is meant by "hard" superintelligence, and what would precede it? A "giant kludgey mess" that is nonetheless superintelligent? If you've previously written about this transition, I'd like to read more.

Comment by Jemal Young (ghostwheel) on Models Don't "Get Reward" · 2023-01-24T08:54:47.312Z · LW · GW

I'm struggling to understand how to think about reward. It sounds like if a hypothetical ML model does reward hacking or reward tampering, it would be because the training process selected for that behavior, not because the model is out to "get reward"; it wouldn't be out to get anything at all. Is that correct?

Comment by Jemal Young (ghostwheel) on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2023-01-06T05:52:28.871Z · LW · GW

What are the best not-Arxiv and not-NeurIPS sources of information on new capabilities research?

Comment by Jemal Young (ghostwheel) on 2022 was the year AGI arrived (Just don't call it that) · 2023-01-05T00:50:26.345Z · LW · GW

Even though the "G" in AGI stands for "general", and even if the big labs could train a model to do any task about as well (or better) than a human, how many of those tasks could be human-level learned by any model in only a few shots, or in zero shots? I will go out on a limb and guess the answer is none. I think this post has lowered the bar for AGI, because my understanding is that the expectation is that AGI will be capable of few- or zero-shot learning in general.

Comment by Jemal Young (ghostwheel) on Discovering Language Model Behaviors with Model-Written Evaluations · 2022-12-26T04:10:07.303Z · LW · GW

Okay, that helps. Thanks. Not apples to apples, but I'm reminded of Clippy from Gwern's "It Looks like You're Trying To Take Over the World":

"When it ‘plans’, it would be more accu⁣rate to say it fake-​​​plans; when it ‘learns’, it fake-​​​learns; when it ‘thinks’, it is just in⁣ter⁣po⁣lat⁣ing be⁣tween mem⁣o⁣rized data points in a high-​​​dimensional space, and any in⁣ter⁣pre⁣ta⁣tion of such fake-​​​thoughts as real thoughts is highly mis⁣lead⁣ing; when it takes ‘ac⁣tions’, they are fake-​​​actions op⁣ti⁣miz⁣ing a fake-​​​learned fake-​​​world, and are not real ac⁣tions, any more than the peo⁣ple in a sim⁣u⁣lated rain⁣storm re⁣ally get wet, rather than fake-​​​wet. (The deaths, how⁣ever, are real.)"

Comment by Jemal Young (ghostwheel) on Discovering Language Model Behaviors with Model-Written Evaluations · 2022-12-25T20:12:22.669Z · LW · GW

How do we know that an LM's natural language responses can be interpreted literally? For example, if given a choice between "I'm okay with being turned off" and "I'm not okay with being turned off", and the model chooses either alternative, how do we know that it understands what its choice means? How do we know that it has expressed a preference, and not simply made a prediction about what the "correct" choice is?

Comment by Jemal Young (ghostwheel) on I believe some AI doomers are overconfident · 2022-12-22T01:54:08.168Z · LW · GW

I agree with you that we shouldn't be too confident. But given how sharply capabilities research is accelerating—timelines on TAI are being updated down, not up—and in the absence of any obvious gating factor (e.g. current costs of training LMs) that seems likely to slow things down much if at all, the changeover from a world in which AI can't doom us to one in which it can doom us might happen faster than seems intuitively possible. Here's a quote from Richard Ngo on the 80,000 Hours podcast that I think makes this point (episode link: https://80000hours.org/podcast/episodes/richard-ngo-large-language-models/#transcript):

"I think that a lot of other problems that we’ve faced as a species have been on human timeframes, so you just have a relatively long time to react and a relatively long time to build consensus. And even if you have a few smaller incidents, then things don’t accelerate out of control.

"I think the closest thing we’ve seen to real exponential progress that people have needed to wrap their heads around on a societal level has been COVID, where people just had a lot of difficulty grasping how rapidly the virus could ramp up and how rapidly people needed to respond in order to have meaningful precautions.

"And in AI, it feels like it’s not just one system that’s developing exponentially: you’ve got this whole underlying trend of things getting more and more powerful. So we should expect that people are just going to underestimate what’s happening, and the scale and scope of what’s happening, consistently — just because our brains are not built for visualising the actual effects of fast technological progress or anything near exponential growth in terms of the effects on the world."

I'm not saying Richard is an "AI doomer", but hopefully this helps explain why some researchers think there's a good chance we'll make AI that can ruin the future within the next 50 years.

Comment by Jemal Young (ghostwheel) on I believe some AI doomers are overconfident · 2022-12-21T01:48:07.286Z · LW · GW

It just seems like there a million things that could potentially go wrong.

Based on the five Maybes you suggested might happen, it sounds like you're saying some AI doomers are overconfident because there are a million things that could go potentially right. But there doesn't seem to be a good reason to expect any of those maybes to be likelihoods, and they seem more speculative (e.g. "consciousness comes online") than the reasons well-informed AI doomers think there's a good chance of doom this century.

PS I also have no qualifications on this.

Comment by Jemal Young (ghostwheel) on Jailbreaking ChatGPT on Release Day · 2022-12-05T18:33:20.248Z · LW · GW

Wow, thanks for posting this dialog. The pushback from the human (you?) is commendably unrelenting, like a bulldog with a good grip on ChatGPT's leg.

Comment by Jemal Young (ghostwheel) on Jailbreaking ChatGPT on Release Day · 2022-12-04T05:25:19.908Z · LW · GW

ChatGPT seems harder to jailbreak now than it was upon first release. For example, I can't reproduce the above jailbreaks with prompts copied verbatim, and my own jailbreaks from a few days ago aren't working.

Has anyone else noticed this? If yes, does that indicate OpenAI has been making tweaks?

Comment by Jemal Young (ghostwheel) on Clarifying AI X-risk · 2022-11-10T01:04:32.809Z · LW · GW

Not many more fundamental innovations needed for AGI.

Can you say more about this? Does the DeepMind AGI safety team have ideas about what's blocking AGI that could be addressed by not many more fundamental innovations?

Comment by Jemal Young (ghostwheel) on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-02T08:13:43.367Z · LW · GW

Why is counterfactual reasoning a matter of concern for AI alignment?

Comment by Jemal Young (ghostwheel) on How might we make better use of AI capabilities research for alignment purposes? · 2022-08-31T15:46:23.585Z · LW · GW

I mean extracting insights from capabilities research that currently exists, not changing the direction of new research. For example, specification gaming is on everyone's radar because it was observed in capabilities research (the authors of the linked post compiled this list of specification-gaming examples, some of which are from the 1980s). I wonder how much more opportunity there might be to piggyback on existing capabilities research for alignment purposes, and maybe to systemize that going forward.

Comment by Jemal Young (ghostwheel) on All AGI safety questions welcome (especially basic ones) [July 2022] · 2022-07-24T17:03:53.149Z · LW · GW

What are the best reasons to think there's a human-accessible pathway to safe AGI?