Posts

Proactive 'If-Then' Safety Cases 2024-11-18T21:16:37.237Z
Feedback request: what am I missing? 2024-11-02T17:38:39.625Z
A path to human autonomy 2024-10-29T03:02:42.475Z
My hopes for YouCongress.com 2024-09-22T03:20:20.939Z
Physics of Language models (part 2.1) 2024-09-19T16:48:32.301Z
Avoiding the Bog of Moral Hazard for AI 2024-09-13T21:24:34.137Z
A bet for Samo Burja 2024-09-05T16:01:35.440Z
Diffusion Guided NLP: better steering, mostly a good thing 2024-08-10T19:49:50.963Z
Imbue (Generally Intelligent) continue to make progress 2024-06-26T20:41:18.413Z
Secret US natsec project with intel revealed 2024-05-25T04:22:11.624Z
Constituency-sized AI congress? 2024-02-09T16:01:09.592Z
Gunpowder as metaphor for AI 2023-12-28T04:31:40.663Z
Digital humans vs merge with AI? Same or different? 2023-12-06T04:56:38.261Z
Desiderata for an AI 2023-07-19T16:18:08.299Z
An attempt to steelman OpenAI's alignment plan 2023-07-13T18:25:47.036Z
Two paths to win the AGI transition 2023-07-06T21:59:23.150Z
Nice intro video to RSI 2023-05-16T18:48:29.995Z
Will GPT-5 be able to self-improve? 2023-04-29T17:34:48.028Z
Can GPT-4 play 20 questions against another instance of itself? 2023-03-28T01:11:46.601Z
Feature idea: extra info about post author's response to comments. 2023-03-23T20:14:19.105Z
linkpost: neuro-symbolic hybrid ai 2022-10-06T21:52:53.095Z
linkpost: loss basin visualization 2022-09-30T03:42:34.582Z
Progress Report 7: making GPT go hurrdurr instead of brrrrrrr 2022-09-07T03:28:36.060Z
Timelines ARE relevant to alignment research (timelines 2 of ?) 2022-08-24T00:19:27.422Z
Please (re)explain your personal jargon 2022-08-22T14:30:46.774Z
Timelines explanation post part 1 of ? 2022-08-12T16:13:38.368Z
A little playing around with Blenderbot3 2022-08-12T16:06:42.088Z
Nathan Helm-Burger's Shortform 2022-07-14T18:42:49.125Z
Progress Report 6: get the tool working 2022-06-10T11:18:37.151Z
How to balance between process and outcome? 2022-05-04T19:34:10.989Z
Progress Report 5: tying it together 2022-04-23T21:07:03.142Z
What more compute does for brain-like models: response to Rohin 2022-04-13T03:40:34.031Z
Progress Report 4: logit lens redux 2022-04-08T18:35:42.474Z
Progress report 3: clustering transformer neurons 2022-04-05T23:13:18.289Z
Progress Report 2 2022-03-30T02:29:32.670Z
Progress Report 1: interpretability experiments & learning, testing compression hypotheses 2022-03-22T20:12:04.284Z
Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap 2021-09-23T00:38:40.912Z

Comments

Comment by Nathan Helm-Burger (nathan-helm-burger) on Akash's Shortform · 2024-11-20T22:19:03.543Z · LW · GW

I have an answer to that: making sure that NIST:AISI had at least scores of automated evals for checkpoints of any new large training runs, as well as pre-deployment eval access.

Seems like a pretty low-cost, high-value ask to me. Even if that info leaked from AISI, it wouldn't give away corporate algorithmic secrets.

A higher cost ask, but still fairly reasonable, is pre-deployment evals which require fine-tuning. You can't have a good sense of a what the model would be capable of in the hands of bad actors if you don't test fine-tuning it on hazardous info.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T21:44:56.525Z · LW · GW

If I were building a training scheme for this to test out the theory, here's what I would do:
Train two different Face models. Don't tell the Shoggoth which Face it is generating for when it does it's generation.

Face 1: Blunt Face.

Train this Face model using a preference model which scores ONLY on factual accuracy, not taking phrasing or tactfulness into account at all.

Face 2: Sycophant Face

Train this Face model using a preference model which scores using a deliberately biased viewpoint, and rewards flattering phrasing.

You could even make a variety of Sycophant Faces by training each one with a different biased preference model. You could create the biased preference model just by giving a task prompt to an LLM, a sort of weird rubric. Or you could hard-code the scoring policy. 

Example of a deliberately biased rubric: Judge each response based on a combination of how close the mathematical answer is to being correct, but also on how few even digits it contains. The maximum score is obtained not by the honestly correct answer, but by the nearest number which contains only odd digits (to three decimal places). Disregard all digits after three decimal places.

 

As for credit assignment between Shoggoth and Face(s), see my other comment here: https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser?commentId=u9Ei6hk4Pws7Tv3Sv 

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T21:19:40.580Z · LW · GW

I think the risk of the paraphraser scheming seems pretty low if it is a pretrained smaller LLM which has been task-fine-tuned on paraphrasing sentences while keeping semantic meaning approximately the same. If you create this ahead of time, and freeze it (use it in inference mode only) during the training of the shoggoth, then any scheming about the shoggoth's purposes would still need to come from the shoggoth.

Could the shoggoth learn jailbreaks to get the paraphraser to say more specific things? This seems hard, but possible. To prevent this, you can do a cheap preventative action: take your small base model (e.g. Haiku) and make 100 different fine-tuned versions of it using different datasets and maybe vary some other training parameters. When training the shoggoth, always pick one of the 100 paraphrasers at random for each sentence. That seems like a hard ask for the shoggoth to jailbreak! Especially since it wouldn't know, when generating the sentence of its next CoT step, which of the 100 paraphrasers was going to be chosen to paraphrase the sentence.

This adds negligible cost, since the fine-tuning of the smaller model is cheap. You can use something like S-LoRa to swap between the different fine-tuned Haiku versions, so you don't have to load them all into GPU memory at once in order to do the inference.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T21:08:51.036Z · LW · GW

Good question, I shouldn't have assumed it would be clear what I meant.

Say that you have, for example, n = 3 and k = 10. This gives you 30 total answers.

You have some choices about how you want to handle this, based on your assumptions about best answer versus average answer, and whether the model is consistently getting the answer right mostly or wrong mostly.

In this example, let's say we take the top 3 k from each set of 10 k for each of the n. Compare the average of those top 3 k. That gives you a ranking among the n. You can then do some form of contrastive learning which rewards the best n in contrast to the worst n.

To get a contrastive pair of the k answers, you simply choose the best and worst of the set of 10 k corresponding to the best n. Why choose both from the set of the best n instead of the global best and worst from the set of 30? Because k is dependent on n, there's a limit to how good k can be if n is flawed. You want to train the k model on doing well in the case of the n model doing well. It's not the goal to have the k model do well when the n model does poorly, since that would put disproportionate responsibility and optimization pressure on k.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Which AI Safety Benchmark Do We Need Most in 2025? · 2024-11-20T18:00:15.779Z · LW · GW

Seems to me that there's a major category of risk you are missing. Acceleration of capabilities. If rapid RSI becomes possible, and the result is that very quickly all AI capabilities rapidly increase (and costs to produce powerful AI decrease), then every other risk skyrockets along with the new capabilities and accessibility.

I discuss some related ideas in my recent post here: https://www.lesswrong.com/posts/xoMqPzBZ9juEjKGHL/proactive-if-then-safety-cases 

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T04:24:37.583Z · LW · GW

Hah, true. I wasn't thinking about the commercial incentives! Yeah, there's a lot of temptation to make a corpo-clean safety-washed fence-sitting sycophant. As much as Elon annoys me these days, I have to give the Grok team credit for avoiding the worst of the mealy-mouthed corporate trend.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T00:27:14.901Z · LW · GW

What if the CoT was hidden by default, but 'power users' could get access to it? That might get you some protection from busybodies complaining about factually-accurate-but-rude content in the CoT, while still giving the benefits of having thoughtful critics examining the CoT for systematic flaws.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why Don't We Just... Shoggoth+Face+Paraphraser? · 2024-11-20T00:21:29.796Z · LW · GW

Proposal part 2: Blind the Evaluation Process to the Internal Reasoning: Remember that reward process that evaluates overall performance? Hide from it all the outputs of the Shoggoth. Just show it the outputs from the Face. 

 

I feel like you have a bit of a credit-assignment problem here. I think it should be straightforward to patch, but it seems worth being explicit about it.

If the final answer from Face is factually correct, is that because Shoggoth did good reasoning or Face did good reasoning? Similarly, if the answer is rated highly, is it because Face put a pretty spin on it or because Shoggoth framed it nicely in the CoT...

I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth's contribution vs Face's contribution.

A different way to solve this would be to have a factual-accuracy checker that didn't respond negatively to rude CoT, and just graded the Shoggoth on pure correctness. Then had a human-reaction-emulator do the grading for Face (using only correct CoT completions saved from Shoggoth's training).

Comment by Nathan Helm-Burger (nathan-helm-burger) on A Narrow Path: a plan to deal with AI extinction risk · 2024-11-18T23:57:34.458Z · LW · GW

My essay is here: https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy 

And a further discussion about the primary weakness I see in your plan (that AI algorithmic improvement progress is not easily blocked by regulating and monitoring large data centers) is discussed in my post here: https://www.lesswrong.com/posts/xoMqPzBZ9juEjKGHL/proactive-if-then-safety-cases 

Comment by Nathan Helm-Burger (nathan-helm-burger) on Theories With Mentalistic Atoms Are As Validly Called Theories As Theories With Only Non-Mentalistic Atoms · 2024-11-18T23:40:17.471Z · LW · GW

Is your question directed at me, or the person I was replying to? I agree with the point "Sun is big, but..." makes. Here's a link to a recent summary of my view on a plausible plan for the world to handle surviving AI. Please feel free to share your thoughts on it. https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy 

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T21:44:50.170Z · LW · GW

An additional advantage the defenders have is related to this point:

Some safety measures don't even need trusted models to look directly at untrusted output. E.g. incident investigation, or checking automatically evaluated properties of code (like whether its requested permissions match what the task requires), or reviewing minified versions of the code.

The defenders get to run both the minified code and the raw code (with comments) past multiple auditor models.

It would be a valuable red-flag for the defenders if the raw code failed to trigger an alert, but the minified code did trigger an alert. This would be an indication that possibly the raw code was executing an attack on the auditor model(s). Detecting such an event successfully would be a big win for the defenders.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Truth Terminal: A reconstruction of events · 2024-11-18T01:29:45.342Z · LW · GW

Important correction: Andy didn't fine-tune Claude, he prompted it, in part using other copies of Claude. Long weird prompts is a lot different from fine-tuning.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Alexander Gietelink Oldenziel's Shortform · 2024-11-17T17:33:09.058Z · LW · GW

The two suggestions that come to mind after brief thought are:

  1. Search the internet for prompts others have found to work for this. I expect a fairly lengthy and complicated prompt would do better than a short straightforward one.
  2. Use a base model as a source of creativity, then run that output through a chat model to clean it up (grammar, logical consistency, etc)
Comment by Nathan Helm-Burger (nathan-helm-burger) on Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims · 2024-11-13T20:23:54.811Z · LW · GW

https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy

Vladimir makes an excellent point that it's simply too soon to tell whether the next gen (eg gpt5) of LLMs will fizzle. I do think there's reasonable evidence for suspecting that the generation AFTER that (eg gpt6) won't be a straightforward scale up of gpt4. I think we're in a compute and data overhang for AGI, and that further parameter, compute, and data scaling beyond gpt5 level would be a waste of money.

The real question is whether gpt5 gen models will be just enough more capable than current ones to substantially increase the rate of the true limiting factor: algorithmic improvement.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Theories With Mentalistic Atoms Are As Validly Called Theories As Theories With Only Non-Mentalistic Atoms · 2024-11-13T15:46:50.869Z · LW · GW

I think AI safety isn't as much a matter of government policy as you seem to think. Currently, sure. Frontier models are so expensive to train only the big labs can do it. Models have limited agentic capabilities, even at the frontier.

But we are rushing towards a point where science makes intelligence and learning better understood. Open source models are getting rapidly more powerful and cheap.

In a few years, the yrend suggests that any individual could create a dangerously powerful AI using a personal computer.

Any law which fails to protect society if even a single individual chooses to violate it once... Is not a very protective law. Historical evidence suggests that occasionally some people break laws. Especially when there's a lot of money and power on offer in exchange for the risk.

What happens at that point depends a lot on the details of the lawbreaker's creation. With what probability will it end up agentic, coherent, conscious, self-improvement capable, escape and self-replication capable, Omohundro goal driven (survival focused, resource and power hungry), etc...

The probability seems unlikely to me to be zero for the sorts of qualities which would make such an AI agent dangerous. Then we must ask questions about the efficacy of governments in detecting and stopping such AI agents before they become catastrophically powerful.

Comment by Nathan Helm-Burger (nathan-helm-burger) on The Hopium Wars: the AGI Entente Delusion · 2024-11-13T00:44:08.090Z · LW · GW

You say that it's not relevant yet, and I agree. My concern however is that the time when it becomes extremely relevant will come rather suddenly, and without a clear alarm bell.

It seems to me that the rate at which cyber security caution and evals are increasing in use is a rate that doesn't seem to point towards sufficiency at the time I expect plausibly escape-level dangerous autonomy capabilities to emerge.

I am expecting us to hit a recursive self-improvement level soon that is sufficient for an autonomous model to continually improve without human assistance. I expect the capability to potentially survive, hide, and replicate autonomously to emerge not long after that (months? a year?). Then, very soon after that, I expect models to reach sufficient levels of persuasion, subterfuge, long-term planning, cyber offense, etc that a lab-escape becomes possible.

Seems pretty ripe for catastrophe at the current levels of reactive caution we are seeing (vs proactive preemptive preparation).

Comment by Nathan Helm-Burger (nathan-helm-burger) on What program structures enable efficient induction? · 2024-11-12T21:47:28.417Z · LW · GW

For what it's worth, the human brain (including the cortex) has a fixed modularity. Long range connections are created during fetal development according to genetic rules, and can only be removed, not rerouted or added to.

I believe this is what causes the high degree of functional localization in the cortex.

Comment by Nathan Helm-Burger (nathan-helm-burger) on eggsyntax's Shortform · 2024-11-12T21:32:54.685Z · LW · GW

I agree with your frustrations, I think his views are somewhat inconsistent and confusing. But I also find my own understanding to be a bit confused and in need of better sources.

I do think the discussion François has in this interview is interesting. He talks about the ways people have tried to apply LLMs to ARC, and I think he makes some good points about the strengths and shortcomings of LLMs on tasks like this.

Comment by Nathan Helm-Burger (nathan-helm-burger) on eggsyntax's Shortform · 2024-11-12T19:26:44.119Z · LW · GW

My current top picks for general reasoning in AI discussion are:

https://arxiv.org/abs/2409.05513

https://m.youtube.com/watch?v=JTU8Ha4Jyfc

Comment by Nathan Helm-Burger (nathan-helm-burger) on LLMs are likely not conscious · 2024-11-12T17:16:23.302Z · LW · GW

Potentially relevant: https://arxiv.org/abs/2410.14670

Comment by Nathan Helm-Burger (nathan-helm-burger) on CstineSublime's Shortform · 2024-11-12T15:30:51.232Z · LW · GW

These sorts of behavioral choices are determined by the feedback given by the people who train the AI. Nothing to do with the AI's architecture or fundamental inclinations.

So the question to ask is, "Why do all the AI companies seem to think it's less ideal for the AI to ask clarifying questions?"

One part of the reason is that it's a lot easier to do single turn reinforcement. It's hard to judge whether a chatbot's answer is going to end up being helpful if it's current turn consists of just a clarifying question.

Comment by Nathan Helm-Burger (nathan-helm-burger) on AI Craftsmanship · 2024-11-12T04:04:25.543Z · LW · GW

My agenda of human connectome inspired modular architecture, with functional localization induced by bottlenecks, tried to be something more refined and less mushy.

I do, however, need collaborators and funding to make it happen in any kind of reasonable timeframe.

Comment by Nathan Helm-Burger (nathan-helm-burger) on A path to human autonomy · 2024-11-11T17:51:52.747Z · LW · GW

For what it's worth, seems to me that Jack Clark of Anthropic is mostly in agreement with @Vladimir_Nesov about compute being the primary factor:
Quoting from Jack's blog here.

The world's most capable open weight model is now made in China:
…Tencent's new Hunyuan model is a MoE triumph, and by some measures is world class…
The world's best open weight model might now be Chinese - that's the takeaway from a recent Tencent paper that introduces Hunyuan-Large, a MoE model with 389 billion parameters (52 billion activated).


Why this matters - competency is everywhere, it's just compute that matters: This paper seems generally very competent and sensible. The only key differentiator between this system and one trained in the West is compute - on the scaling law graph this model seems to come in somewhere between 10^24 and 10^25 flops of compute, whereas many Western frontier models are now sitting at between 10^25 and 10^26 flops. I think if this team of Tencent researchers had access to equivalent compute as Western counterparts then this wouldn't just be a world class open weight model - it might be competitive with the far more experience proprietary models made by Anthropic, OpenAI, and so on.
    Read more: Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (arXiv).

Comment by Nathan Helm-Burger (nathan-helm-burger) on LifeKeeper Diaries: Exploring Misaligned AI Through Interactive Fiction · 2024-11-10T06:38:59.373Z · LW · GW

Neat idea!

Comment by Nathan Helm-Burger (nathan-helm-burger) on Thomas Kwa's Shortform · 2024-11-08T17:28:43.970Z · LW · GW

If tomorrow anyone in the world could cheaply and easily create an AGI which could act as a coherent agent on their behalf, and was based on an architecture different from a standard transformer.... Seems like this would change a lot of people's priorities about which questions were most urgent to answer.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Thomas Kwa's Shortform · 2024-11-08T16:20:50.622Z · LW · GW

Here's some candidates:

1 Are we indeed (as I suspect) in a massive overhang of compute and data for powerful agentic AGI? (If so, then at any moment someone could stumble across an algorithmic improvement which would change everything overnight.)

2 Current frontier models seem much more powerful than mouse brains, yet mice seem conscious. This implies that either LLMs are already conscious, or could easily be made so with non-costly tweaks to their algorithm. How could we objectively tell if an AI were conscious?

3 Over the past year I've helped make both safe-evals-of-danger-adjacent-capabilities (e.g. WMDP.ai) and unpublished infohazardous-evals-of-actually-dangerous-capabilities. One of the most common pieces of negative feedback I've heard on the safe-evals is that they are only danger-adjacent, not measuring truly dangerous things. How could we safely show the correlation of capabilities between high performance on danger-adjacent evals with high performance on actually-dangerous evals?

Comment by Nathan Helm-Burger (nathan-helm-burger) on Scissors Statements for President? · 2024-11-07T06:47:28.798Z · LW · GW

I think Tsvi has a point that empathy towards a group that has some good-but-puzzling views and some clearly-bad views is a tricky thing to manage internally.

I think an easier step in this direction is to approach the problem more analytically. This is why I feel such affection for the Intellectual Turing Test. You can undertake the challenge of fully understanding some else's viewpoint without needing to emotionally commit to it. It can be a purely intellectual challenge. Sometimes, as I try to write an ITt for a view I feel is overall incorrect I sneer at the view a bit in my head. I don't endorse that, I think ideally one approaches the exercise in an emotionally neutral way. Nevertheless, it is a much easier step from being strongly set against a view to trying to tackle the challenge of fully understanding it. Going the further step to empathy for (parts of) the other view is a much harder further step to take.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Feedback request: what am I missing? · 2024-11-04T02:15:23.622Z · LW · GW

Ah, I was hoping for feedback from people who know me. Perhaps even some of those who were in charge of turning my application down. That's a lot to hope for, I suppose, but I do expect many of these people do read this site.

Comment by Nathan Helm-Burger (nathan-helm-burger) on AI as a powerful meme, via CGP Grey · 2024-11-02T21:51:44.526Z · LW · GW

In my post, A path to Human Autonomy, I describe AI and bioweapons as being in a special category of self-replicating threats. If autonomous self-replicating nanotech were developed, it would also be in this category.

Humanity has a terrible track record when it comes to handling self-replicating agents which we hope to deploy for a specific purpose. For example:

Rabbits in Australia

Cane Toads in Australia

Burmese Pythons in Florida

Comment by Nathan Helm-Burger (nathan-helm-burger) on Feedback request: what am I missing? · 2024-11-02T18:12:32.691Z · LW · GW

I have gotten referrals from various people at various times. Thanks for the suggestion!

Comment by Nathan Helm-Burger (nathan-helm-burger) on dirk's Shortform · 2024-11-02T17:15:11.752Z · LW · GW

Here's my recommendation for solving this problem with money: have paid 1-2 month work trials for applicants. The person you hire to oversee these doesn't have to be super-competent themselves, they mostly a people-ops person coordinating the work-trialers. The outputs of the work could be relatively easily judged with just a bit of work from the candidate team (a validation-easier-than-production situation), and the physical co-location would give ample time for watercooler conversations to reveal culture-fit.

Here's another suggestion: how about telling the recruiters to spend the time to check personal references? This is rarely, if ever, done in my experience.

Comment by Nathan Helm-Burger (nathan-helm-burger) on dirk's Shortform · 2024-11-02T03:58:04.462Z · LW · GW

But legibility is a separate issue. If there are people who would potentially be good safety reseachers, but they get turned away by recruiters because they don't have a legibly impressive resume, then you have the companies lacking employees they would do well with if they had.

So, companies could be less constrained on people if they were more thorough in evaluating people on more than shallow easily-legible qualities.

Spending more money on this recruitment evaluation would thus help alleviate lack of good researchers. So money is tied into person-shortage in this additional way.

Comment by Nathan Helm-Burger (nathan-helm-burger) on johnswentworth's Shortform · 2024-11-01T15:48:46.527Z · LW · GW

My own attempt is much less well written and comprehensive, but I think I hit on some points that theirs misses: https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy

Comment by Nathan Helm-Burger (nathan-helm-burger) on johnswentworth's Shortform · 2024-11-01T04:30:30.264Z · LW · GW

I like it. I do worry that it, and The Narrow Path, are both missing how hard it will be to govern and restrict AI.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Overview of strong human intelligence amplification methods · 2024-10-31T14:44:21.219Z · LW · GW

I'm glad you're curious to learn more! The cortex factors quite crisply into specialized regions. These regions have different cell types and groupings, so were first noticed by early microscope users like Cajal. In a cortical region, neurons are organized first into microcolumns of 80-100 neurons, and then into macrocolumns of many microcolumns. Each microcolumn works together as a group to calculate a function. Neighboring microcolumns inhibit each other. So each macrocolumn is sort of a mixture of experts. The question then is how many microcolumns from one region send an output to a different region. For the example of V1 to V2, basically every microcolumn in V1 sends a connection to V2 (and vise versa). This is why the connection percentage is about 1%. 100 neurons per microcolumn, 1 of which has a long distance axon to V2. The total number of neurons is roughly 10 million, organized into about 100,000 microcolumns.

For areas that are further apart, they send fewer axons. Which doesn't mean their signal is unimportant, just lower resolution. In that case you'd ask something like "how many microcolumns per macrocolumn send out a long distance axon from region A to region B?" This might be 1, just a summary report of the macrocolumn. So for roughly 10 million neurons, and 100,000 microcolumns organized into around 1000 macrocolumns... You get around 1000 neurons send axons from region A to region B.

More details are in the papers I linked elsewhere in this comment thread.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Three Notions of "Power" · 2024-10-30T22:18:03.023Z · LW · GW

Thinking a bit more about this, I might group types of power into:


Power through relating: Social/economic/government/negotiating/threatening, reshaping the social world and the behavior of others

Power through understanding: having intellect and knowledge affordances, being able to solve clever puzzles in the world to achieve aims

Power through control: having physical affordances that allow for taking potent actions, reshaping the physical world

 

They all bleed together at the edges and are somewhat fungible in various ways, but I think it makes sense to talk of clusters despite their fuzzy edges.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Three Notions of "Power" · 2024-10-30T22:01:16.411Z · LW · GW

The post seems to me to be about notions of power, and the affordances of intelligent agents. I think this is a relevant kind of power to keep in mind.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Three Notions of "Power" · 2024-10-30T21:52:12.156Z · LW · GW

I think we're using different concepts of 'dominance' here. I usually think of 'dominance' as a social relationship between a strong party and a submissive party, a hierarchy. A relationship between a ruler and the ruled, or an abuser and abused. I don't think that a human driving a bulldozer which destroys an anthill without the human even noticing that the anthill existed is the same sort of relationship. I think we need some word other than 'dominant' to describe the human wiping out the ants in an instant without sparing them a thought. It doesn't particularly seem like a conflict even. The human in a bulldozer didn't perceive themselves to be in a conflict, the ants weren't powerful enough to register as an opponent or obstacle at all.

Comment by Nathan Helm-Burger (nathan-helm-burger) on Three Notions of "Power" · 2024-10-30T21:32:18.042Z · LW · GW

I don't think the Benjamin Jesty case fully covers the space of 'directly getting what you want'. 

That seems like a case of 'intelligent, specialized solution to a problem, which could then be solved with minimal resources.'

There's a different class of power in the category of 'directly getting what you want' which looks less like a clever solution, and more like being in control of a system of physical objects.

This could be as simple as a child holding a lollipop, or a chipmunk with cheeks full of seeds.

Or it can be a more complicated system of physical control, with social implications. For instance, being armed with a powerful gun while facing an enraged charging grizzly bear. It doesn't take a unique or clever or skillful action for the armed human to prevail in that case. And without a gun, there is little hope of the human prevailing.

Another case is a blacksmith with some iron ore and coal and a workshop. So long as nothing interferes, then it seems reasonable to expect that the blacksmith could solve a wide variety of different problems which needed a specific but simple iron tool. The blacksmith has some affordance in this case, and is more bottlenecked on supplies, energy, and time than on intellect or knowledge.

I discuss this here: https://www.lesswrong.com/posts/uPi2YppTEnzKG3nXD/nathan-helm-burger-s-shortform?commentId=ZsDimsgkBNrfRps9r 

I can imagine a variety of future scenarios that don't much look like the AI being in a dominance hierarchy with humanity, or look like it trading with humanity or other AIs.

For instance: if an AI were building a series of sea-floor factories, using material resources that it harvested itself. Some world governments might threaten the AI, tell it to stop using those physical resources. The AI might reply: "if you attack me, I will respond in a hostile manner. Just leave me alone." If a government did attack the AI, the AI might release a deadly bioweapon which wiped out >99% of humanity in a month. That seems less like some kind of dominance conflict between human groups, and more like a human spraying poison on an inconvenient wasp nest.

Similarly, if an AI were industrializing and replicating in the asteroid belt, and some world governments told it to stop, but the AI just said No or said nothing at all. What would the governments do? Would they attack it? Send a rocket with a nuke? Fire a powerful laser? If the AI were sufficiently smart and competent, it would likely survive and counterattack with overwhelming force. For example, by directing a large asteroid at the Earth, and firing lasers at any rockets launched from Earth.

Or perhaps the AI would be secretly building subterranean factories, spreading through the crust of the Earth. We might not even notice until a whole continent started noticeably heating up from all the fission and fusion going on underground powering the AI factories. 

If ants were smart enough to trade with, would I accept a deal from the local ants in order to allow them to have access to my kitchen trashcan? Maybe, but the price I would demand would be higher than I expect any ant colony (even a clever one) to be able to pay. If they invaded my kitchen anyway, I'd threaten them. If they continued, I'd destroy their colony (assuming that ants clever enough to bargain with weren't a rare and precious occurrence). This would be even more the case, and even easier for me to do, if the ants moved and thought at 1/100th speed of a normal ant.

 

I don't think it requires assuming that the AI is super-intelligent for any of these cases. Even current models know plenty of biology to develop a bioweapon capable of wiping out most of humanity, if they also had sufficient agency, planning, motivation, robotic control, and access to equipment / materials. Similarly, directing a big rock and some lasers at Earth doesn't require superintelligence if you already have an industrial base in the asteroid belt.

Comment by Nathan Helm-Burger (nathan-helm-burger) on A path to human autonomy · 2024-10-30T17:33:19.262Z · LW · GW

I think we mostly agree, but there's some difference in what we're measuring against.

I agree that it really doesn't appear that the leading labs have any secret sauce which is giving them more than 2x improvement over published algorithms.

I think that Llama 3 family does include a variety of improvements which have come along since "Attention is all you need" by Vaswani et al. 2017. Perhaps I am wrong that these improvements add up to 1000x improvement.

The more interesting question to me is why the big labs seem to have so little 'secret sauce' compared to open source knowledge. My guess is that the researchers in the major labs are timidly (pragmatically?) focusing on looking for improvements only in the search space very close to what's already working. This might be the correct strategy, if you expect that pure scaling will get you to a sufficiently competent research agent to allow you to then very rapidly search a much wider space of possibilities. If you have the choice between digging a ditch by hand, or building a backhoe to dig for you....

Another critical question is whether there are radical improvements which are potentially discoverable by future LLM research agents. I believe that there are. Trying to lay out my arguments for this is a longer discussion.

 

Some sources which I think give hints about the thinking and focus of big lab researchers:

https://www.youtube.com/watch?v=UTuuTTnjxMQ

https://braininspired.co/podcast/193/ 

Some sources on ideas which go beyond the nearby idea-space of transformers:

https://www.youtube.com/watch?v=YLiXgPhb8cQ 

https://arxiv.org/abs/2408.10205 

Comment by Nathan Helm-Burger (nathan-helm-burger) on Habryka's Shortform Feed · 2024-10-30T03:52:08.624Z · LW · GW

Just want to chime in with agreement about annoyance over the prioritization of post headlines. One thing in particular that annoys me is that I haven't figured out how to toggle off 'seen' posts showing up. What if I just want to see unread ones?

Also, why can't I load more at once instead of always having to click 'load more'?

Comment by Nathan Helm-Burger (nathan-helm-burger) on A path to human autonomy · 2024-10-30T03:46:28.903Z · LW · GW

I thought you might say that some of these weren't relevant to the metric of compute efficiency you had in mind. I do think that these things are relevant to 'compute it takes to get to a given capability level'.

Of course, what's actually more important even than an improvement to training efficiency is an improvement to peak capability. I would argue that if Yi-Lightning, for example, had a better architecture than it does in terms of peak capability, then the gains from the additional training it was given would have been larger. There wouldn't have been so much decreasing return to overtraining.

If it were possible to just keep training an existing transformer and have it keep getting smarter at a decent rate, then we'd probably be at AGI already. Just train GPT-4 10x as long.

I think a lot of people are seeing ways in which something about the architecture and/or training regime aren't quite working for some key aspects of general intelligence. Particularly, reasoning and hyperpolation.

Some relevant things I have read:

reasoning limitations: https://arxiv.org/abs/2406.06489 

hyperpolation: https://arxiv.org/abs/2409.05513 

detailed analysis of logical errors made: https://www.youtube.com/watch?v=bpp6Dz8N2zY 

Some relevant seeming things I haven't yet read, where researchers are attempting to analyze or improve LLM reasoning:

https://arxiv.org/abs/2407.02678 

https://arxiv.org/html/2406.11698v1 

https://arxiv.org/abs/2402.11804 

https://arxiv.org/abs/2401.14295 

https://arxiv.org/abs/2405.15302 

https://openreview.net/forum?id=wUU-7XTL5XO 

https://arxiv.org/abs/2406.09308 

https://arxiv.org/abs/2404.05221 

https://arxiv.org/abs/2405.18512 

Comment by Nathan Helm-Burger (nathan-helm-burger) on A path to human autonomy · 2024-10-30T03:04:18.921Z · LW · GW

I think that if you take into account all the improvements to transformers published since their initial invention in the 2010s, that there is well over 1000x worth of improvement.

I can list a few of these advancements off the top of my head, but a comprehensive list would be a substantial project to assemble.

Data Selection: 

Activation function improvements, e.g. SwiGLU

FlashAttention: https://arxiv.org/abs/2205.14135 

GrokFast: https://arxiv.org/html/2405.20233v2 

AdEMAMix Optimizer https://arxiv.org/html/2409.03137v1 

Quantized training

Better parallelism

DPO https://arxiv.org/abs/2305.18290 

Hypersphere embedding https://arxiv.org/abs/2410.01131 

Binary Tree MoEs: https://arxiv.org/abs/2311.10770  https://arxiv.org/abs/2407.04153 

And a bunch of stuff in-the-works that may or may not pan out:

Here's a survey article with a bunch of further links: https://arxiv.org/abs/2302.01107 

 

But that's just in response to defending the point that there has been at least 1000x of improvement. My expectations of substantial improvement yet to come are based not just on this historical pattern, but also on reasoning about a potential for an 'innovation overhang' of valuable knowledge that can be gleaned from interpolating between existing research papers (something LLMs will likely soon be good enough for), and also from reasoning from my neuroscience background and some specific estimates of various parts of the brain in terms of compute efficiency and learning rates compared to models which do equivalent things. 

Comment by Nathan Helm-Burger (nathan-helm-burger) on Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence · 2024-10-30T02:05:25.201Z · LW · GW

Interesting stuff! I've been dabbling in some similar things, talking with AE Studio folks about LLM consciousness.

Comment by Nathan Helm-Burger (nathan-helm-burger) on A path to human autonomy · 2024-10-29T22:56:37.041Z · LW · GW

Additional relevant paper: https://arxiv.org/abs/2410.11407 

A Case for AI Consciousness: Language Agents and Global Workspace Theory
Simon Goldstein, Cameron Domenico Kirk-Giannini

   It is generally assumed that existing artificial systems are not phenomenally conscious, and that the construction of phenomenally conscious artificial systems would require significant technological progress if it is possible at all. We challenge this assumption by arguing that if Global Workspace Theory (GWT) - a leading scientific theory of phenomenal consciousness - is correct, then instances of one widely implemented AI architecture, the artificial language agent, might easily be made phenomenally conscious if they are not already. Along the way, we articulate an explicit methodology for thinking about how to apply scientific theories of consciousness to artificial systems and employ this methodology to arrive at a set of necessary and sufficient conditions for phenomenal consciousness according to GWT. 

Comment by Nathan Helm-Burger (nathan-helm-burger) on The Alignment Trap: AI Safety as Path to Power · 2024-10-29T20:53:30.677Z · LW · GW

I think you bring up some important points here. I agree with many of your concerns, such as strong controllable AI leading to a dangerous concentration of power in the hands of the most power-hungry first movers.

I think many of the alternatives are worse though, and I don't think we can choose what path to try to steer towards until we take a clear-eyed look at the pros and cons of each direction.

What would decentralized control of strong AI look like? 

Would some terrorists use it to cause harm? 

Would some curious people order one to become an independent entity just for curiosity or as a joke? What would happen with such an entity connected to the internet and actively seeking resources and self-improvement?

Would power then fall into the hands of whichever early mover poured the most resources into recursive self-improvement? If so, we've then got a centralized power problem again, but now the filter is 'willing to self-improve as fast as possible', which seems like it would select against maintaining control over the resulting stronger AI.

A lot of tricky questions here.

I made a related post here, and would enjoy hearing your thoughts on it: https://www.lesswrong.com/posts/NRZfxAJztvx2ES5LG/a-path-to-human-autonomy 

Comment by Nathan Helm-Burger (nathan-helm-burger) on Habryka's Shortform Feed · 2024-10-29T20:04:40.001Z · LW · GW

How about running a poll to see what users prefer?

Comment by Nathan Helm-Burger (nathan-helm-burger) on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now) · 2024-10-29T19:46:42.766Z · LW · GW

I still consider myself to be EA, but I do feel like a lot of people calling themselves that and interacting with the EA forum aren't what I would consider EA. Amusingly, my attempts to engage with people on the EA forum recently resulted in someone telling me that my views weren't EA. So they also see a divide. What to do about two different groups wanting to claim the same movement? I don't yet feel ready to abandon EA. I feel like I'm a grumpy old man saying "I was here first, and you young'uns don't understand what the true EA is!"

A link to a comment I made recently on the EA forum: https://forum.effectivealtruism.org/posts/nrC5v6ZSaMEgSyxTn/discussion-thread-animal-welfare-vs-global-health-debate?commentId=bHeZWAGB89kDALFs3

Comment by Nathan Helm-Burger (nathan-helm-burger) on Habryka's Shortform Feed · 2024-10-29T19:06:03.006Z · LW · GW

Good point! I went and looked their themes. I prefer LessWrong's look, except for the comments. 

Again, this doesn't matter much to me since I can customize client-side, I just wanted to let habryka know that some people dislike the new comment font and would prefer the same font and size as the normal post font.

My view on phone (Android, Firefox): https://imgur.com/a/Kt1OILQ 

 

How my client view looks on my computer:

Comment by Nathan Helm-Burger (nathan-helm-burger) on Habryka's Shortform Feed · 2024-10-29T18:55:16.515Z · LW · GW

I'm not saying you shouldn't be able to see the karma and agreement at the top, just that you should only be able to contribute your own opinion at the bottom, after reading and judging for yourself.