Posts

Constituency-sized AI congress? 2024-02-09T16:01:09.592Z
Gunpowder as metaphor for AI 2023-12-28T04:31:40.663Z
Digital humans vs merge with AI? Same or different? 2023-12-06T04:56:38.261Z
Desiderata for an AI 2023-07-19T16:18:08.299Z
An attempt to steelman OpenAI's alignment plan 2023-07-13T18:25:47.036Z
Two paths to win the AGI transition 2023-07-06T21:59:23.150Z
Nice intro video to RSI 2023-05-16T18:48:29.995Z
Will GPT-5 be able to self-improve? 2023-04-29T17:34:48.028Z
Can GPT-4 play 20 questions against another instance of itself? 2023-03-28T01:11:46.601Z
Feature idea: extra info about post author's response to comments. 2023-03-23T20:14:19.105Z
linkpost: neuro-symbolic hybrid ai 2022-10-06T21:52:53.095Z
linkpost: loss basin visualization 2022-09-30T03:42:34.582Z
Progress Report 7: making GPT go hurrdurr instead of brrrrrrr 2022-09-07T03:28:36.060Z
Timelines ARE relevant to alignment research (timelines 2 of ?) 2022-08-24T00:19:27.422Z
Please (re)explain your personal jargon 2022-08-22T14:30:46.774Z
Timelines explanation post part 1 of ? 2022-08-12T16:13:38.368Z
A little playing around with Blenderbot3 2022-08-12T16:06:42.088Z
Nathan Helm-Burger's Shortform 2022-07-14T18:42:49.125Z
Progress Report 6: get the tool working 2022-06-10T11:18:37.151Z
How to balance between process and outcome? 2022-05-04T19:34:10.989Z
Progress Report 5: tying it together 2022-04-23T21:07:03.142Z
What more compute does for brain-like models: response to Rohin 2022-04-13T03:40:34.031Z
Progress Report 4: logit lens redux 2022-04-08T18:35:42.474Z
Progress report 3: clustering transformer neurons 2022-04-05T23:13:18.289Z
Progress Report 2 2022-03-30T02:29:32.670Z
Progress Report 1: interpretability experiments & learning, testing compression hypotheses 2022-03-22T20:12:04.284Z
Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap 2021-09-23T00:38:40.912Z

Comments

Comment by Nathan Helm-Burger (nathan-helm-burger) on Rapid capability gain around supergenius level seems probable even without intelligence needing to improve intelligence · 2024-05-06T21:16:23.377Z · LW · GW

I think this is a good description of the problem. The fact that Einstein's brain had a similar amount of compute and data, similar overall architecture, similar fundamental learning algorithm means that a brain-like algorithm can substantially improve in capability without big changes to these things. How similar to the brain's learning algorithm does an ML algorithm have to be before we should expect similar effects? That seems unclear to me. I think a lot of people who try to make forecasts about AI progress are greatly underestimating the potential impact of algorithm development, and how the rate of algorithmic progress could be accelerated by large-scale automated searches by sub-AGI models like GPT-5.

A related market I have on manifold:

https://manifold.markets/NathanHelmBurger/gpt5-plus-scaffolding-and-inference

https://manifold.markets/NathanHelmBurger/1hour-agi-a-system-capable-of-any-c

A related comment I made on a different post:

https://www.lesswrong.com/posts/sfWPjmfZY4Q5qFC5o/why-i-m-doing-pauseai?commentId=p2avaaRpyqXnMrvWE

Comment by Nathan Helm-Burger (nathan-helm-burger) on Shannon Vallor’s “technomoral virtues” · 2024-05-04T17:19:25.445Z · LW · GW

Something I think humanity is going to have to grapple with soon is the ethics of self-modification / self-improvement, and the perils of value-shift due to rapid internal and external changes. How do we stay true to ourselves while changing fundamental aspects of what it means to be human?

Comment by Nathan Helm-Burger (nathan-helm-burger) on OHGOOD: A coordination body for compute governance · 2024-05-04T15:54:25.982Z · LW · GW

This is a solid seeming proposal. If we are in a world where the majority of danger comes from big datacenters and large training runs, I predict that this sort of regulation would be helpful. I don't think we are in that world though, which I think limits how useful this would be. Further explanation here: https://www.lesswrong.com/posts/sfWPjmfZY4Q5qFC5o/why-i-m-doing-pauseai?commentId=p2avaaRpyqXnMrvWE

Comment by Nathan Helm-Burger (nathan-helm-burger) on How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)? · 2024-05-04T05:32:42.960Z · LW · GW

Personally, I have gradually moved to seeing this as lowering my p(doom). I think humanity's best chance is to politically coordinate to globally enforce strict AI regulation. I think the most likely route to this becoming politically feasible is through empirical demonstrations of the danger of AI. I think AI is more likely to be legibly empirically dangerous to political decision-makers if it is used in the military. Thus, I think military AI is, counter-intuitively, lowering p(doom). A big accident that caused military AI to kill thousands of innocent people that the military had not intended to kill could be really great for p(doom). 

 

This is a sad thing to think, obviously. I'm hopeful we can come up with harmless demonstrations of the dangers involved, so that political action will be taken without anyone needing to be killed.

In scenarios where AI becomes powerful enough to present an extinction risk to humanity, I don't expect that the level of robotic weaponry it has control over to matter much. It will have many many opportunities to hurt humanity that look nothing like armed robots and greatly exceed the power of armed robots.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why I'm doing PauseAI · 2024-05-04T04:36:47.942Z · LW · GW

I absolutely sympathize, and I agree that with the world view / information you have that advocating for a pause makes sense. I would get behind 'regulate AI' or 'regulate AGI', certainly. I think though that pausing is an incorrect strategy which would do more harm than good, so despite being aligned with you in being concerned about AGI dangers, I don't endorse that strategy.

Some part of me thinks this oughtn't matter, since there's approximately ~0% chance of the movement achieving that literal goal. The point is to build an anti-AGI movement, and to get people thinking about what it would be like to be able to have the government able to issue an order to pause AGI R&D, or turn off datacenters, or whatever. I think that's a good aim, and your protests probably (slightly) help that aim.

I'm still hung up on the literal 'Pause AI' concept being a problem though. Here's where I'm coming from: 

1. I've been analyzing the risks of current day AI. I believe (but will not offer evidence for here) current day AI is already capable of providing small-but-meaningful uplift to bad actors intending to use it for harm (e.g. weapon development). I think that having stronger AI in the hands of government agencies designed to protect humanity from these harms is one of our best chances at preventing such harms. 

2. I see the 'Pause AI' movement as being targeted mostly at large companies, since I don't see any plausible way for a government or a protest movement to enforce what private individuals do with their home computers. Perhaps you think this is fine because you think that most of the future dangers posed by AI derive from actions taken by large companies or organizations with large amounts of compute. This is emphatically not my view. I think that actually more danger comes from the many independent researchers and hobbyists who are exploring the problem space. I believe there are huge algorithmic power gains which can, and eventually will, be found. I furthermore believe that beyond a certain threshold, AI will be powerful enough to rapidly self-improve far beyond human capability. In other words, I think every AI researcher in the world with a computer is like a child playing with matches in a drought-stricken forest. Any little flame, no matter how small, could set it all ablaze and kill everyone. Are the big labs playing with bonfires dangerous? Certainly. But they are also visible, and can be regulated and made to be reasonably safe by the government. And the results of their work are the only feasible protection we have against the possibility of FOOM-ing rogue AGI launched by small independent researchers. Thus, pausing the big labs would, in my view, place us in greater danger rather than less danger. I think we are already well within the window of risk from independent-researcher-project-initiated-FOOM. Thus, the faster we get the big labs to develop and deploy worldwide AI-watchdogs, the sooner we will be out of danger.

I know these views are not the majority views held by any group (that I know of). These are my personal inside views from extensive research. If you are curious about why I hold these views, or more details about what I believe, feel free to ask. I'll answer if I can.

Comment by Nathan Helm-Burger (nathan-helm-burger) on An Introduction to AI Sandbagging · 2024-05-04T04:04:20.012Z · LW · GW

I've mentioned it elsewhere, but I'll repeat it again here since it's relevant. For GPT-style transformers, and probably for other model types, you can smoothly subtly degrade the performance of the model by adding in noise to part or all of the activations. This is particularly useful for detecting sandbagging, because you would expect sandbagging to show up as an anomalous increase in capability, breaking the smooth downward trend in capability, as you increased the amount of noise injected or fraction of activations to which noise was added. I found that there was noticeable decrease in performance even when the noise was added even to a small fraction (e.g. < 1%) of the activations and was created to be small relative to the given activation-magnitude it was being added to. 

Comment by Nathan Helm-Burger (nathan-helm-burger) on KAN: Kolmogorov-Arnold Networks · 2024-05-04T03:36:32.297Z · LW · GW

So, after reading the KAN paper, and thinking about it in the context of this post: https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their

My vague intuition is that the same experiment done with a KAN would result in a clearer fractal which wiggled less once training loss had plateaued. Is that also other people's intuition?

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why is AGI/ASI Inevitable? · 2024-05-03T21:52:55.001Z · LW · GW

I, on the other hand, have very little confidence that people trying to build AGI will fail to quickly (within the next 3 years, aka 2027) find ways to do it. I do have confidence that we can politically coordinate to stop the situation becoming an extinction or near-extinction-level catastrophe. So I place much less emphasis on abstaining from publishing ideas which may help both alignment and capabilities, and more emphasis on figuring out ways to generate empirical evidence of the danger before it is too late, so as to facilitate political coordination.

I think that the situation in which humanity fails to politically coordinate to avoid building catastrophically dangerous AI is a situation that leads into conflict, likely a World War III with wide-spread use of nuclear weapons. I don't expect humanity to go extinct from this and I don't expect the rogue AGI to emerge as the victor, but I do think it is in everyone's interests to work hard to avoid such a devastating conflict. I do think that any such conflict would likely wipe out the majority of humanity. That's a pretty grim risk to be facing on the horizon.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Open Thread Spring 2024 · 2024-05-02T21:01:40.392Z · LW · GW

EY may be too busy to respond, but you can probably feel pretty safe consulting with MIRI employees in general. Perhaps also Conjecture employees, and Redwood Research employees, if you read and agree with their views on safety. That at least gives you a wider net of people to potentially give you feedback.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Open Thread Spring 2024 · 2024-05-02T20:51:59.051Z · LW · GW

Some features I'd like:

a 'mark read' button next to posts so I could easily mark as read posts that I've read elsewhere (e.g. ones cross-posted from a blog I follow)

a 'not interested' button which would stop a given post from appearing in my latest or recommended lists. Ideally, this would also update my recommended posts so as to recommend fewer posts like that to me. (Note: the hide-from-front-page button could be this if A. It worked even on promoted/starred posts, and B. it wasn't hidden in a three-dot menu where it's frustrating to access)

a 'read later' button which will put the post into a reading list for me that I can come back to later.

a toggle button for 'show all' / 'show only unread' so that I could easily switch between the two modes.

These features would help me keep my 'front page' feeling cleaner and more focused.

Comment by Nathan Helm-Burger (nathan-helm-burger) on D0TheMath's Shortform · 2024-05-02T19:58:50.144Z · LW · GW

Yeah, I agree that releasing open-weights non-frontier models doesn't seem like a frontier capabilities advance. It does seem potentially like an open-source capabilities advance.

That can be bad in different ways. Let me pose a couple hypotheticals.

  1. What if frontier models were already capable of causing grave harms to the world if used by bad actors, and it is only the fact that they are kept safety-fine-tuned and restricted behind APIs that is preventing this? In such a case, it's a dangerous thing to have open-weight models catching up.

  2. What if there is some threshold beyond which a model would be capable enough of recursive self-improvement with sufficient scaffolding and unwise pressure from an incautious user. Again, the frontier labs might well abstain from this course. Especially if they weren't sure they could trust the new model design created by the current AI. They would likely move slowly and cautiously at least. I would not expect this of the open-source community. They seem focused on pushing the boundaries of agent-scaffolding and incautiously exploring the whatever they can.

So, as we get closer to danger, open-weight models take on more safety significance.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why is AGI/ASI Inevitable? · 2024-05-02T19:46:10.290Z · LW · GW

I think people can in theory collectively decide not to build AGI or ASI.

Certainly you as an individual can choose this! Where things get tricky is when asking whether that outcome seems probable, or coming up with a plan to bring that outcome about. Similarly, as a child I wondered, "Why can't people just choose not to have wars, just decide not to kill each other?"

People have selfish desires, and group loyalty instincts, and limited communication and coordination capacity, and the world is arranged in such a way that sometimes this leads to escalating cycles of group conflict that are net bad for everyone involved.

That's the scenario I think we are in with AI development also. Everyone would be safer if we didn't, but getting everyone to agree not to and hold to that agreement even in private seems intractably hard.

[Edit: Here's a link to Steven Pinker's writing on the Evolution of War. I don't think, as he does, that the world is trending strongly towards global peace, but I do think he has some valid insights into the sad lose-lose nature of war.]

Comment by Nathan Helm-Burger (nathan-helm-burger) on KAN: Kolmogorov-Arnold Networks · 2024-05-01T20:16:05.079Z · LW · GW

I'm not so sure. You might be right, but I suspect that catastrophic forgetting may still be playing an important role in limiting the peak capabilities of an LLM of given size. Would it be possible to continue Llama3 8B's training much much longer and have it eventually outcompete Llama3 405B stopped at its normal training endpoint?

I think probably not? And I suspect that if not, that part (but not all) of the reason would be catastrophic forgetting. Another part would be limited expressivity of smaller models, another thing which the KANs seem to help with.

Comment by Nathan Helm-Burger (nathan-helm-burger) on KAN: Kolmogorov-Arnold Networks · 2024-05-01T17:41:37.333Z · LW · GW

Wow, this is super fascinating.

A juicy tidbit:

Catastrophic forgetting is a serious problem in current machine learning [24]. When a human masters a task and switches to another task, they do not forget how to perform the first task. Unfortunately, this is not the case for neural networks. When a neural network is trained on task 1 and then shifted to being trained on task 2, the network will soon forget about how to perform task 1. A key difference between artificial neural networks and human brains is that human brains have functionally distinct modules placed locally in space. When a new task is learned, structure re-organization only occurs in local regions responsible for relevant skills [25, 26], leaving other regions intact. Most artificial neural networks, including MLPs, do not have this notion of locality, which is probably the reason for catastrophic forgetting.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Nathan Helm-Burger's Shortform · 2024-05-01T02:26:08.622Z · LW · GW

Yeah, I was playing around with using a VAE to compress the logits output from a language transformer. I did indeed settle on treating the vocab size (e.g. 100,000) as the 'channels'.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Nathan Helm-Burger's Shortform · 2024-04-30T17:09:51.175Z · LW · GW

So when trying to work with language data vs image data, an interesting assumption of the ml vision research community clashes with an assumption of the language research community. For a language model, you represent the logits as a tensor with shape [batch_size, sequence_length, vocab_size]. For each position in the sequence, there are a variety of likelihood values of possible tokens for that position.

In vision models, the assumption is that the data will be in the form [batch_size, color_channels, pixel_position]. Pixel position can be represented as a 2d tensor or flattened to 1d.

See the difference? Sequence position comes first, pixel position comes second. Why? Because a color channel has a particular meaning, and thus it is intuitive for a researcher working with vision data to think about the 'red channel' as a thing which they might want to separate out to view. What if we thought of 2nd-most-probable tokens the same way? Is it meaningful to read a sequence of all 1st-most-probable tokens, then read a sequence of all 2nd-most-probable tokens? You could compare the semantic meaning, and the vibe, of the two sets. But this distinction doesn't feel as natural for language logits as it does for color channels.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Disentangling Competence and Intelligence · 2024-04-29T18:23:58.963Z · LW · GW

This is something I've been thinking about as well, and I think you do a good job explaining it. There's definitely more to breakdown and analyze within competence and intelligence. Such as simulation being a distinct sort of part of intelligence. A measure of how many moves a player can think ahead in a strategy game like chess or Go. How large of a possibility-tree can they build in the available time? With what rate of errors? How quickly does the probability of error increase as the tree increases in size? How does their performance decrease as the complexity of the variables needed to be tracked for an accurate simulation increase?

Comment by Nathan Helm-Burger (nathan-helm-burger) on We are headed into an extreme compute overhang · 2024-04-27T13:56:00.602Z · LW · GW

As a population of AGI copies, the obvious first step towards 'taking over the world' is to try to improve oneself.

I expect that the described workforce could find improvements within a week of clock time including one or more of:

Improvements to peak intelligence without needing to fully retrain.

Improvements to inference efficiency.

Improvements to ability to cooperate and share knowledge.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Open Thread Spring 2024 · 2024-04-24T17:08:27.824Z · LW · GW

I've been using a remineralization toothpaste imported from Japan for several years now, ever since I mentioned reading about remineralization to a dentist from Japan. She recommended the brand to me. FDA is apparently bogging down release in the US, but it's available on Amazon anyway. It seems to have slowed, but not stopped, the formation of cavities. It does seem to result in faster plaque build-up around my gumline, like the bacterial colonies are accumulating some of the minerals not absorbed by the teeth. The brand I use is apagard. [Edit: I'm now trying the recommended mouthwash CloSys as the link above recommended, using it before brushing, and using Listerine after. The CloSys seems quite gentle and pleasant as a mouthwash. Listerine is harsh, but does leave my teeth feeling cleaner for much longer. I'll try this for a few years and see if it changes my rate of cavity formation.]

Comment by Nathan Helm-Burger (nathan-helm-burger) on Vector Planning in a Lattice Graph · 2024-04-24T16:58:22.909Z · LW · GW

Thanks faul-sname. I came to the comments to give a much lower effort answer along the same lines, but yours is better. My answer: lazy local evaluations of nodes surrounding either your current position or the position of the goal. So long as you can estimate a direction from yourself to the goal, there's no need to embed the whole graph. This is basically gradient descent...

Comment by Nathan Helm-Burger (nathan-helm-burger) on 1-page outline of Carlsmith's otherness and control series · 2024-04-24T16:42:11.699Z · LW · GW

Personally, I most enjoyed the first one in the the series, and enjoyed listening to Joe's reading of it even more than when I just read it. My top three are 1, 6, 7.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Scenario planning for AI x-risk · 2024-04-24T04:46:21.912Z · LW · GW

I generally agree, i just have some specific evidence which I believe should adjust estimates in the report towards expecting more accessible algorithmic improvements than some people seem to think.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Cooperation is optimal, with weaker agents too  -  tldr · 2024-04-23T22:39:13.316Z · LW · GW

"What Dragons?", says the lion, "I see no Dragons, only a big empty universe. I am the most mighty thing here."

Whether or not the Imagined Dragons are real isn't relevant to the gazelles if there is no solid evidence with which to convince the lions. The lions will do what they will do. Maybe some of the lions do decide to believe in the Dragons, but there is no way to force all of them to do so. The remainder will laugh at the dragon-fearing lions and feast on extra gazelles. Their children will reproduce faster.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Language and Capabilities: Testing LLM Mathematical Abilities Across Languages · 2024-04-23T22:32:38.370Z · LW · GW

Something else to play around with that I've tried. You can force the models to handle each digit separately by putting a space between each digit of a number like "3 2 3 * 4 3 7 = "

Comment by Nathan Helm-Burger (nathan-helm-burger) on Red teaming: challenges and research directions · 2024-04-23T22:00:01.853Z · LW · GW

As part of a team of experts building private biorisk evals for AI, and doing private red-teaming experiments, I appreciate this post.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Scenario planning for AI x-risk · 2024-04-23T21:46:07.474Z · LW · GW

The interesting thing to me about the question, "Will we need a new paradigm for AGI?" is that a lot of people seem to be focused on this but I think it misses a nearby important question.

As we get closer to a complete AGI, and start to get more capable programming and research assistant AIs, will those make algorithmic exploration cheaper and easier, such that we see a sort of 'Cambrian explosion' of model architectures which work well for specific purposes, and perhaps one of these works better at general learning than anything we've found so far and ends up being the architecture that first reaches full transformative AGI?

The point I'm generally trying to make is that estimates of software/algorithmic progress are based on the progress being made (currently) mostly by human minds. The closer we get to generally competent artificial minds, the less we should expect past patterns based on human inputs to hold.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Fluent dreaming for language models (AI interpretability method) · 2024-04-23T16:20:54.519Z · LW · GW

Very cool. I bet janus would dig this.

Comment by Nathan Helm-Burger (nathan-helm-burger) on CHAT Diplomacy: LLMs and National Security · 2024-04-23T05:37:48.918Z · LW · GW

I think perhaps in some ways this overstated the present risks at the time, but I think this forecasting is still relevant for the upcoming future. AI is continuing to improve. At some point, people will be able to make agents that can do a lot of harm. We can't rely on compute governance with the level of confidence we would need to be comfortable with that as a solution given the risks.

An example of recent work showing the potential for compute governance to fail: https://arxiv.org/abs/2403.10616v1 

Comment by Nathan Helm-Burger (nathan-helm-burger) on How to Model the Future of Open-Source LLMs? · 2024-04-23T05:26:07.417Z · LW · GW

Unless there is a 'peak-capabilities wall' that gets hit by current architectures that doesn't get overcome by the combined effects of the compute-efficiency-improving algorithmic improvements. In that case, the gap would close because any big companies that tried to get ahead by just naively increasing compute and having just a few hidden algorithmic advantages would be unable to get very far ahead because of the 'peak-capabilities wall'. It would get cheaper to get to the wall, but once there, extra money/compute/data would be wasted. Thus, a shrinking-gap world.

I'm not sure if there will be a 'peak-capabilities wall' in this way, or if the algorithmic advancements will be creative enough to get around it. The shape of the future in this regard seems highly uncertain to me. I do think it's theoretically possible to get substantial improvements in peak capabilities and also in training/inference efficiencies. Will such improvements keep arriving relatively gradually as they have been? Will there be a sudden glut at some point when the models hit a threshold where they can be used to seek and find algorithmic improvements? Very unclear.

Comment by Nathan Helm-Burger (nathan-helm-burger) on What good is G-factor if you're dumped in the woods? A field report from a camp counselor. · 2024-04-23T05:10:11.652Z · LW · GW

My childhood was quite different, in that I was quite kind-hearted, honest, and generally obedient to the letter of the law... but I was constantly getting into trouble in elementary school. I just kept coming up with new interesting things to do that they hadn't made an explicit rule against yet. Once they caught me doing the new thing, they told me never to do it again and made a new rule. So then I came up with a new interesting thing to try.

How about tying several jump ropes together to make a longer rope, tying a lasso on one end, lassoing an exhaust pipe on the roof of the one-story flat-roofed school, and then climb-walking up the wall onto the roof of the school? Oh? That's against the rules now? Ok. 

How about digging a tunnel under a piece of playground equipment which had an extended portion touching the ground, and then having fun crawling through the ~4ft long tunnel? No tunneling anymore? Ok.

How about taking off my shoes and socks and shinnying up the basketball hoop support pole? No? Ok.

Finding interesting uses for various plant materials collected from the borders of the playground... banned one after another.

An endless stream of such things.

My principal was more amused and exasperated with me than mad.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Nathan Helm-Burger's Shortform · 2024-04-20T05:21:33.063Z · LW · GW

I mean, that is kinda what I'm trying to get at. I feel like any sufficiently powerful AI should be treated as a dangerous tool, like a gun. It should be used carefully and deliberately.

Instead we're just letting anyone do whatever with them. For now, nothing too bad has happened, but I feel confident that the danger is real and getting worse quickly as models improve.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-19T17:13:41.053Z · LW · GW

Personally, I like mentally splitting the space into AI safety (emphasis on measurement and control), AI alignment (getting it to align to the operators purposes and actually do what the operators desire), and AI value-alignment (getting the AI to understand and care about what people need and want). Feels like a Venn diagram with a lot of overlap, and yet some distinct non-overlap spaces.

By my framing, Redwood research and METR are more centrally AI safety. ARC/Paul's research agenda more of a mix of AI safety and AI alignment. MIRI's work to fundamentally understand and shape Agents is a mix of AI alignment and AI value-alignment. Obviously success there would have the downstream effect of robustly improving AI safety (reducing the need for careful evals and control), but is a more difficult approach in general with less immediate applicability. I think we need all these things!

Comment by Nathan Helm-Burger (nathan-helm-burger) on Cooperation is optimal, with weaker agents too  -  tldr · 2024-04-18T19:08:47.703Z · LW · GW

As the last gazelle dies, how much comfort does it take in the idea that some vengeful alien may someday punish the lions for their cruelty? Regardless of whether it is comforted or not by this idea, it still dies.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Cooperation is optimal, with weaker agents too  -  tldr · 2024-04-18T18:25:52.139Z · LW · GW

"Cooperation is optimal," said the lion to the gazelles, "sooner or later I will get one of you. If I must give chase, we will all waste calories. Instead, sacrifice your least popular member to me, and many calories will be saved."

The gazelle didn't like it, but they eventually agreed. The lion population boomed, as the well-fed and unchallenged lions flourished. Soon the gazelle were pushed to extinction, and most of the lions starved because the wildebeest were not so compliant.

Anyway, I'm being silly with my story. The point I'm making is that only certain subsets of possible world states with certain power distributions are cooperation-optimal. And unfortunately I don't think our current world state, or any that I foresee as probable, are cooperation-optimal for ALL humans. And if you allow for creation of non-human agents, then the fraction of actors for whom cooperation-with-humans is optimal could drop off very quickly. AIs can have value systems far different from ours, and have affordances for actions we don't have, and this changes the strategic payoffs in an unfavorable way.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Nathan Helm-Burger's Shortform · 2024-04-18T18:12:20.089Z · LW · GW

I feel like I'd like the different categories of AI risk attentuation to be referred to as more clearly separate:

AI usability safety - would this gun be safe for a trained professional to use on a shooting range? Will it be reasonably accurate and not explode or backfire?

AI world-impact safety - would it be safe to give out one of these guns for 0.10$ to anyone who wanted one?

AI weird complicated usability safety - would this gun be safe to use if a crazy person tried to use a hundred of them plus a variety of other guns, to make an elaborate Rube Goldberg machine and fire it off with live ammo with no testing?

Comment by Nathan Helm-Burger (nathan-helm-burger) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-17T17:46:12.020Z · LW · GW

Yeah, I'd say that in general the US gov attempts to regulate cryptography have been a bungled mess which helped the situation very little, if at all. I have the whole mess mentally categorized as an example of how we really want AI regulation NOT to be handled.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-16T20:13:06.564Z · LW · GW

Whoohoo!

Comment by Nathan Helm-Burger (nathan-helm-burger) on Creating unrestricted AI Agents with Command R+ · 2024-04-16T20:05:20.211Z · LW · GW

My guess is that further scaffolding work (e.g. improved enforcement of thinking out a step by step plan, and then executing the steps of the plan in order) would result in unlocking a lot more capabilities from current open-weight models. I also expect that this is but a hint of things to come, and that by this time next year we'll see evidence that then-current models can accomplish a wide range of complex multi-step tasks.

Comment by Nathan Helm-Burger (nathan-helm-burger) on What convincing warning shot could help prevent extinction from AI? · 2024-04-13T21:34:46.842Z · LW · GW

I'm hopeful that a sufficiently convincing demo could convince politicians/military brass/wealthy powerful people/the public. Probably different demos could be designed to be persuasive to these different audiences. Ideally, the demos could be designed early, and you could get buy-in from the target audience that if the describe demo were successful then they would agree that "something needed to be done". Even better would be concrete commitments, but I think there's value even without that. Also being as prepared as possible to act on a range of plausible natural warning shots seems good. Getting similar pre-negotiated agreements that if X did happen, it should be considered a tipping point for taking action.

Comment by Nathan Helm-Burger (nathan-helm-burger) on simeon_c's Shortform · 2024-04-10T20:32:14.854Z · LW · GW

Something which concerns me is that transformative AI will likely be a powerful destabilizing force, which will place countries currently behind in AI development (e.g. Russia and China) in a difficult position. Their governments are currently in the position of seeing that peacefully adhering to the status quo may lead to rapid disempowerment, and that the potential for coercive action to interfere with disempowerment is high. It is pretty clearly easier and cheaper to destroy chip fabs than create them, easier to kill tech employees with potent engineering skills than to train new ones.

I agree that conditions of war make safe transitions to AGI harder, make people more likely to accept higher risk. I don't see what to do about the fact that the development of AI power is itself presenting pressures towards war. This seems bad. I don't know what I can do to make the situation better though.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Matthew Barnett's Shortform · 2024-04-05T18:10:27.825Z · LW · GW

I'm confused here Matthew. It seems to me that it is highly probable that AI systems which want takeover vs ones that want moderate power combined with peaceful coexistence with humanity... are pretty hard to distinguish early on. And early on is when it's most important for humanity to distinguish between them, before those systems have gotten power and thus we can still stop them.

Picture a merciless un-aging sociopath capable of duplicating itself easily and rapidly were on a trajectory of gaining economic, political, and military power with the aim of acquiring as much power as possible. Imagine that this entity has the option of making empty promises and highly persuasive lies to humans in order to gain power, with no intention of fulfilling any of those promises once it achieves enough power.

That seems like a scary possibility to me. And I don't know how I'd trust an agent which seemed like it could be this, but was making really nice sounding promises. Even if it was honoring its short-term promises while still under the constraints of coercive power from currently dominant human institutions, I still wouldn't trust that it would continue keeping its promises once it had the dominant power.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Thomas Kwa's Shortform · 2024-04-05T17:54:20.121Z · LW · GW

Sounds like you use bad air purifiers, or too few, or run them on too low of a setting. I live in a wildfire prone area, and always keep a close eye on the PM2.5 reports for outside air, as well as my indoor air monitor. My air filters do a great job of keeping the air pollution down inside, and doing something like opening a door gives a noticeable brief spike in the PM2.5.

Good results require: fresh filters, somewhat more than the recommended number of air filters per unit of area, running the air filters on max speed (low speeds tend to be disproportionately less effective, giving unintuitively low performance).

Comment by Nathan Helm-Burger (nathan-helm-burger) on Wei Dai's Shortform · 2024-04-05T17:35:09.573Z · LW · GW

From talking with people who do work on a lot of grant committees in the NIH and similar funding orgs, it's really hard to do proper blinding of reviews. Certain labs tend to focus on particular theories and methods, repeating variations of the same idea...  So if you are familiar the general approach of a particular lab and it's primary investigator, you will immediately recognize and have a knee-jerk reaction (positive or negative) to a paper which pattern-matches to the work that that lab / subfield is doing. 

Common reactions from grant reviewers:

Positive - "This fits in nicely with my friend Bob's work. I respect his work, I should argue for funding this grant."

Neutral - "This seems entirely novel to me, I don't recognize it as connecting with any of the leading trendy ideas in the field or any of my personal favorite subtopics. Therefore, this seems high risk and I shouldn't argue too hard for it."

Slightly negative - "This seems novel to me, and doesn't sound particularly 'jargon-y' or technically sophisticated. Even if the results would be beneficial to humanity, the methods seem boring and uncreative. I will argue slightly against funding this."

Negative - "This seems to pattern match to a subfield I feel biased against. Even if this isn't from one of Jill's students, it fits with Jill's take on this subtopic. I don't want views like Jill's gaining more traction. I will argue against this regardless of the quality of the logic and preliminary data presented in this grant proposal."

Comment by Nathan Helm-Burger (nathan-helm-burger) on Wei Dai's Shortform · 2024-04-05T17:24:27.242Z · LW · GW

From the years in academia studying neuroscience and related aspects of bioengineering and medicine development... yeah. So much about how effort gets allocated is not 'what would be good for our country's population in expectation, or good for all humanity'. It's mostly about 'what would make an impressive sounding research paper that could get into an esteemed journal?', 'what would be relatively cheap and easy to do, but sound disproportionately cool?', 'what do we guess that the granting agency we are applying to will like the sound of?'.  So much emphasis on catching waves of trendiness, and so little on estimating expected value of the results.

Research an unprofitable preventative-health treatment which plausibly might have significant impacts on a wide segment of the population? Booooring.

Research an impractically-expensive-to-produce fascinatingly complex clever new treatment for an incredibly rare orphan disease? Awesome.

Comment by Nathan Helm-Burger (nathan-helm-burger) on How might we align transformative AI if it’s developed very soon? · 2024-04-05T02:54:12.325Z · LW · GW

One point I’ve seen raised by people in the latter group is along the lines of: “It’s very unlikely that we’ll be in a situation where we’re forced to build AI systems vastly more capable than their supervisors. Even if we have a very fast takeoff - say, going from being unable to create human-level AI systems to being able to create very superhuman systems ~overnight - there will probably still be some way to create systems that are only slightly more powerful than our current trusted systems and/or humans; to use these to supervise and align systems slightly more powerful than them; etc. (For example, we could take a very powerful, general algorithm and simply run it on a relatively low amount of compute in order to get a system that isn’t too powerful.)” This seems like a plausible argument that we’re unlikely to be stuck with a large gap between AI systems’ capabilities and their supervisors’ capabilities; I’m not currently clear on what the counter-argument is.

 

I agree that this is a very promising advantage for Team Safety. I do think that, in order to make good use of this potential advantage, the AI creators need to be cautious going into the process. 

One way that I've come up with to 'turn down' the power of an AI system is to simply inject small amounts of noise into its activations. 

Comment by Nathan Helm-Burger (nathan-helm-burger) on Richard Ngo's Shortform · 2024-04-03T20:10:01.996Z · LW · GW

Tangentially related (spoilers for Worth the Candle):

I think it'd be hard to do a better cohesive depiction of Utopia than the end of Worth the Candle by A Wales. I mean, I hope someone does do it, I just think it'll be challenging to do!

Comment by Nathan Helm-Burger (nathan-helm-burger) on Nathan Helm-Burger's Shortform · 2024-04-03T20:03:16.827Z · LW · GW

https://youtu.be/Xd5PLYl4Q5Q?si=EQ7A0oOV78z7StX2

Cute demo of Claude, GPT4, and Gemini building stuff in Minecraft

Comment by Nathan Helm-Burger (nathan-helm-burger) on New paper on aligning AI with human values · 2024-03-31T17:01:18.398Z · LW · GW

Still reading the paper, but so far I love it. This feels like a big step forward in thinking about the issues at hand which addresses so many of the concerns I had about limitations of previous works. Whether or not the proposed technical solution works out as well as hoped, I feel confident that your framing of the problem and presentation of desiderata of a solution are really excellent. I think that alone is a big step forward for the frontier of thought on this subject.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Modern Transformers are AGI, and Human-Level · 2024-03-30T20:12:20.927Z · LW · GW

I think my comment is related to yours: https://www.lesswrong.com/posts/gP8tvspKG79RqACTn/modern-transformers-are-agi-and-human-level?commentId=RcmFf5qRAkTA4dmDo

Also see Leogao's comment and my response to it: https://www.lesswrong.com/posts/gP8tvspKG79RqACTn/modern-transformers-are-agi-and-human-level?commentId=YzM6cSonELpjZ38ET

Comment by Nathan Helm-Burger (nathan-helm-burger) on Modern Transformers are AGI, and Human-Level · 2024-03-30T20:07:52.214Z · LW · GW

I think my comment (link https://www.lesswrong.com/posts/gP8tvspKG79RqACTn/modern-transformers-are-agi-and-human-level?commentId=RcmFf5qRAkTA4dmDo ) relates to yours. I think there is a tool/process/ability missing that I'd call mastery-of-novel-domain. I also think there's a missing ability of "integrating known facts to come up with novel conclusions pointed at by multiple facts". Unsure what to call this. Maybe knowledge-integration or worldview-consolidation?