Posts

Research Jan/Feb 2024 2024-01-01T06:02:47.785Z
To what extent is the UK Government's recent AI Safety push entirely due to Rishi Sunak? 2023-10-27T03:29:28.465Z
What are the best published papers from outside the alignment community that are relevant to Agent Foundations? 2023-08-05T03:02:33.003Z
Ateliers: But what is an Atelier? 2023-07-01T05:57:19.510Z
Ateliers: Motivation 2023-06-27T13:07:06.129Z
Scaffolded LLMs: Less Obvious Concerns 2023-06-16T10:39:58.835Z
What do beneficial TDT trades for humanity concretely look like? 2023-06-10T06:50:21.817Z
Requisite Variety 2023-04-21T08:07:28.751Z
Ng and LeCun on the 6-Month Pause (Transcript) 2023-04-09T06:14:18.757Z
No Summer Harvest: Why AI Development Won't Pause 2023-04-06T03:53:34.469Z
100 Dinners And A Workshop: Information Preservation And Goals 2023-03-28T03:13:06.362Z
Alignment Targets and The Natural Abstraction Hypothesis 2023-03-08T11:45:28.579Z
Stephen Fowler's Shortform 2023-01-27T07:13:01.418Z
Swap and Scale 2022-09-09T22:41:49.682Z
Searching for Modularity in Large Language Models 2022-09-08T02:25:31.711Z
What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas. 2022-08-16T02:09:39.635Z
How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It) 2022-08-10T18:14:08.786Z
Broad Basins and Data Compression 2022-08-08T20:33:16.846Z
Race Along Rashomon Ridge 2022-07-07T03:20:59.701Z
Identification of Natural Modularity 2022-06-25T15:05:17.793Z

Comments

Comment by Stephen Fowler (LosPolloFowler) on Increasing IQ by 10 Points is Possible · 2024-03-20T12:14:53.243Z · LW · GW

This is your second post and you're still being vague about the method. I'm updating strongly towards this being a hoax and I'm surprised people are taking you seriously.

Edit: I'll offer you a 50 USD even money bet that your method won't replicate when tested by a 3rd party with more subjects and a proper control group.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2024-02-19T06:22:23.249Z · LW · GW

You are given a string s corresponding to the Instructions for the construction of an AGI which has been correctly aligned with the goal of converting as much of the universe into diamonds as possible. 

What is the conditional Kolmogorov Complexity of the string s' which produces an AGI aligned with "human values" or any other suitable alignment target.

To convert an abstract string to a physical object, the "Instructions" are read by a Finite State Automata, with the state of the FSA at each step dictating the behavior of a robotic arm (with appropriate mobility and precision) with access to a large collection of physical materials. 

Comment by Stephen Fowler (LosPolloFowler) on Is a random box of gas predictable after 20 seconds? · 2024-02-10T05:43:59.496Z · LW · GW

Tangential. 

Is part of the motivation behind this question to think about the level of control that a super-intelligence could have on a complex system if it was only able to only influence a small part of that system?

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2024-01-11T10:37:02.499Z · LW · GW

I was not precise enough in my language and agree with you highlighting that what "alignment" means for LLM is a bit vague. While people felt Sydney Bing was cool, if it was not possible to reign it in it would have made it very difficult for Microsoft to gain any market share. An LLM that doesn't do what it's asked or regularly expresses toxic opinions is ultimately bad for business.

In the above paragraph understand "aligned" to mean in the concrete sense of "behaves in a way that is aligned with it's parent companies profit motive", rather than "acting in line with humanities CEV". To rephrase the point I was making above, I feel much of (a majority even) of today's alignment research is focused on the the first definition of alignment, whilst neglecting the second.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2024-01-08T07:10:14.666Z · LW · GW

A concerning amount of alignment research is focused on fixing misalignment in contemporary models, with limited justification for why we should expect these techniques to extend to more powerful future systems.

By improving the performance of today's models, this research makes investing in AI capabilities more attractive, increasing existential risk.

Imagine an alternative history in which GPT-3 had been wildly unaligned. It would not have posed an existential risk to humanity but it would have made putting money into AI companies substantially less attractive to investors.

Comment by Stephen Fowler (LosPolloFowler) on Agent membranes and causal distance · 2024-01-03T04:43:42.604Z · LW · GW

Nice post.

"Membranes are one way that embedded agents can try to de-embed themselves from their environment."

I would like to hear more elaboration on "de-embedding". For agents who which are embedded in and interact directly with the physical world, I'm not sure that a process of de-embedding is well defined.

There are fundamental thermodynamic properties of agents that are relevant here. Discussion of agent membranes could also include an analysis of how the environment and agent do work on each other via the mebrane, and how the agent dissipates waste heat and excess entropy to the environment. 

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-12-17T02:12:42.087Z · LW · GW

"Day by day, however, the machines are gaining ground upon us; day by day we are becoming more subservient to them; more men are daily bound down as slaves to tend them, more men are daily devoting the energies of their whole lives to the development of mechanical life. The upshot is simply a question of time, but that the time will come when the machines will hold the real supremacy over the world and its inhabitants is what no person of a truly philosophic mind can for a moment question."

— Samuel Butler, DARWIN AMONG THE MACHINES, 1863

Comment by Stephen Fowler (LosPolloFowler) on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2023-12-16T04:04:31.435Z · LW · GW

An additional distinction between contemporary and future alignment challenges is that the latter concerns the control of physically deployed, self aware system.


Alex Altair has previously highlighted that they will (microscopically) obey time reversal symmetry[1] unlike the information processing of a classical computer program. This recent paper published in Entropy[2] touches on the idea that a physical learning machine (the "brain" of a causal agent) is an "open irreversible dynamical system" (pg 12-13).

  1. ^
  2. ^

    Milburn GJ, Shrapnel S, Evans PW. "Physical Grounds for Causal Perspectivalism"  Entropy. 2023; 25(8):1190. https://doi.org/10.3390/e25081190

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-12-14T23:55:41.310Z · LW · GW

Feedback wanted!

What are your thoughts on the following research question:

"What nontrivial physical laws or principles exist governing the behavior of agentic systems."

(Very open to feedback along the lines of "hey that's not really a research question")

 

Comment by Stephen Fowler (LosPolloFowler) on How do you feel about LessWrong these days? [Open feedback thread] · 2023-12-06T07:51:43.142Z · LW · GW

Yes, perhaps there could be a way having dialogues edited for readability.

Comment by Stephen Fowler (LosPolloFowler) on Stupid Question: Why am I getting consistently downvoted? · 2023-11-30T06:19:51.509Z · LW · GW

I strongly downvoted Homework Answer: Glicko Ratings for War. The reason is because it's appears to be a pure data dump that isn't intended to be actually read by a human. As it is a follow up to a previous post it might have been better as a comment or edit on the original post linking to your github with the data instead. 

Looking at your post history, I will propose that you could improve the quality of your posts by spending more time on them. There are only a few users who manage to post multiple times a week and consistently get many upvotes. 

Comment by Stephen Fowler (LosPolloFowler) on Raemon's Deliberate (“Purposeful?”) Practice Club · 2023-11-24T01:35:36.191Z · LW · GW

When you say you were practising Downwell for the course of a month, how many hours was this in total?

Comment by Stephen Fowler (LosPolloFowler) on Benito's Shortform Feed · 2023-11-06T03:39:22.681Z · LW · GW

Is this what you'd cynically expect from an org regularizing itself or was this a disappointing surprise for you?

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-11-06T03:37:33.979Z · LW · GW

I strongly believe that, barring extremely strict legislation, one of the initial tasks given to the first human level artificial intelligence will be to work to develop more advanced machine learning techniques. During this period we will see unprecedented technological developments and any many alignment paradigms rooted in the empirical behavior of the previous generation of systems may no longer be relevant.

Comment by Stephen Fowler (LosPolloFowler) on The other side of the tidal wave · 2023-11-04T03:47:13.108Z · LW · GW

I predict most humans choose to reside in virtual worlds and possibly have their brain altered to forget that it's not real. 

Comment by Stephen Fowler (LosPolloFowler) on Comp Sci in 2027 (Short story by Eliezer Yudkowsky) · 2023-10-30T00:36:12.337Z · LW · GW

"AI safety, as in, the subfield of computer science concerned with protecting the brand safety of AI companies"

Made me chuckle.

I enjoyed the read but I wish this was much shorter, because there's a lot of very on the nose commentary diluted by meandering dialogue.

I remain skeptical that by 2027 end-users will need to navigate self-awareness or negotiate with LLM-powered devices for basic tasks (70% certainty it will not be a problem). This is coming from a belief that end user devices won't be running the latest and most powerful models, and that argumentative, self aware behavior is something that will be heavily selected against. Even within an oligopoly, market forces should favor models that are not counterproductive in executing basic tasks.

However, as the story suggests, users may still need to manipulate devices to perform actions loosely deemed morally dubious by a companies PR department.

The premise underlying these arguments is that greater intelligence doesn't necessarily yield self-awareness or agentic behavior. Human's aren't agentic because we're intelligent, we're agentic because it enhancing the likelihood of gene propagation**. 

In certain models (like MiddleManager-Bot), agentic traits are likely to be actively selected.. But I suspect there will be a substantial effort to ensure your compiler, toaster etc aren't behaving agentically, particularly if these traits results in antagonistic behavior to the consumer.** 


*By selection I mean both through a models training, and also via more direct adjustment from human and nonhuman programmers. 
 

** A major crux here is that the assumption that intelligence doesn't inevitably spawn agency without other forces selecting for it in some way. I have no concrete experience attempting training frontier models to be or not be agentic, so could be completely wrong on this point.

This doesn't imply that agentic systems will emerge solely from deliberate selection. There are a variety of selection criteria which don't explicitly specify self-awareness or agentic behavior but are best satisfied by systems possessing those traits. 

Comment by Stephen Fowler (LosPolloFowler) on Comparing representation vectors between llama 2 base and chat · 2023-10-29T04:23:51.140Z · LW · GW

Is there reason to think the "double descent" seen in observation 1 relates to the traditional "double descent" phenomena?

My initial guess is no.

Comment by Stephen Fowler (LosPolloFowler) on To what extent is the UK Government's recent AI Safety push entirely due to Rishi Sunak? · 2023-10-27T13:26:14.445Z · LW · GW
Comment by Stephen Fowler (LosPolloFowler) on To what extent is the UK Government's recent AI Safety push entirely due to Rishi Sunak? · 2023-10-27T04:05:35.691Z · LW · GW

That's a good suggestion. I wasn't sure if I could make the question qualitative enough for a prediction market. I'm thinking something along the lines of "If Rishi Sunak is removed from office (in the next 3 years) is funding to the Frontier Taskforce reduced by 50% or more within 6 months".

Comment by Stephen Fowler (LosPolloFowler) on Architects of Our Own Demise: We Should Stop Developing AI · 2023-10-26T03:57:28.230Z · LW · GW

Without governance you're stuck trusting that the lead researcher (or whoever is in control) turns down near infinite power and instead act selflessly. That seems like quite the gamble.

Comment by Stephen Fowler (LosPolloFowler) on Architects of Our Own Demise: We Should Stop Developing AI · 2023-10-26T03:53:50.033Z · LW · GW

What I find incredible is how contributing to the development of existentially dangerous systems is viewed as a morally acceptable course of action within communities that on paper accept that AGI is a threat.

Both OpenAI and Anthropic are incredibly influential among AI safety researchers, despite both organisations being key players in bringing the advent of TAI ever closer.

Both organisations benefit from lexical confusion over the word "safety".

The average person concerned with existential risk from AGI might assume "safety" means working to reduce the likelihood that we all die. They would be disheartened to learn that many "AI Safety" researchers are instead focused on making sure contemporary LLMs behave appropriately. Such "safety" research simply makes the contemporary technology more viable and profitable, driving investment and reducing timelines. There is to my knowledge no published research that proves these techniques will extend to controlling AGI in a useful way.*

OpenAI's "Superalignment" plan is a more ambitious safety play.Their plan to "solve" alignment involves building a human level general intelligence within 4 years and then using this to automate alignment research.

But there are two obvious problems:

  1. a human level general intelligence is already most of the way toward a superhuman general intelligence (simply give it more compute). Cynically, Superintelligence is a promise that OpenAI's brightest safety researchers will be trying their hardest to bring about an AGI within 4 years.

  2. The success of Superalignment means we are now in the position of trusting that a for-profit, private entity will only use the human level AI researchers to research safety, instead of making the incredibly obvious play of having virtual researchers research how to build the next generation better, smarter automated researchers.

To conclude, if it looks like a duck, swims like a duck and quacks like a duck, it's a capabilities researcher.

*This point could (and probably should) be a post in itself. Why wouldn't techniques that work on contemporary AI systems extend to AGI?

Pretend for a moment that you and I are silicon-based aliens who have recently discovered that carbon based lifeforms exist, and can be used to run calculations. Scientists have postulated that by creating complex enough carbon structures we could invent "thinking animals". We anticipate that these strange creatures will be built in the near future and that they might be difficult to control.

As we can't build thinking animals today, we are stuck studying single cell carbon organisms. A technique has just been discovered in which we can use a compound called "sugar" to influence the direction in which these simple organisms move.

Is it reasonable to then conclude that you will be able to predict and control the behaviour of much more complex, multicelled creature called a "human" by spreading sugar out on the ground?

Comment by Stephen Fowler (LosPolloFowler) on Genocide isn't Decolonization · 2023-10-20T09:06:32.655Z · LW · GW

This feels like you're engaging with the weakest argument against Israel's recent aggression to make your point. You are not going to find many people who disagree with "violence against civilians is bad" on LessWrong.

It also strikes me as bizarre that this post mentions only the civilian casualties on one side and not the far greater (and rapidly growing) number of Palestinians who have been killed.

Comment by Stephen Fowler (LosPolloFowler) on Amazon to invest up to $4 billion in Anthropic · 2023-09-26T05:58:27.962Z · LW · GW

That it is so difficult for Anthropic to reassure people stems from the contrast between Anthropic's responsibility focused mission statements and the hard reality of them receiving billions in dollars of profit motivated investment.

It is rational to draw conclusions by weighting a companies actions more heavily than their PR.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-09-16T11:48:31.742Z · LW · GW

"Let us return for a moment to Lady Lovelace’s objection, which stated that the machine can only do what we tell it to do.

One could say that a man can ‘inject’ an idea into the machine, and that it will respond to a certain extent and then drop into quiescence, like a piano string struck by a hammer. Another simile would be an atomic pile of less than critical size: an injected idea is to correspond to a neutron entering the pile from without. Each such neutron will cause a certain disturbance which eventually dies away. If, however, the size of the pile is sufficiently increased, the disturbance caused by such an incoming neutron will very likely go on and on increasing until the whole pile is destroyed. 

Is there a corresponding phenomenon for minds, and is there one for machines?"
 

— Alan Turing, Computing Machinery and Intelligence, 1950

Comment by Stephen Fowler (LosPolloFowler) on Reproducing ARC Evals' recent report on language model agents · 2023-09-02T02:39:09.494Z · LW · GW

I believe you should err on the side of not releasing it.

Comment by Stephen Fowler (LosPolloFowler) on Chess as a case study in hidden capabilities in ChatGPT · 2023-08-19T14:00:58.573Z · LW · GW

I am 85% confident that this won't work. The issue isn't that the prompt hasn't made it clear enough that illegal moves are off the table, the issue is that chatGPT isn't able to keep track of the board state well enough to avoid making illegal moves.

I've tried a game with GPT4 where it was fed the above prompt plus the FEN of the game and also had it "draw" the board. It seems to really struggle with it's geometric understanding of the game, as you'd expect. For example, it struggled with identifying which squares were under attack from a knight. I think this reflects a limitations of the current model and I don't think this is something a clever prompt will fix.

Comment by Stephen Fowler (LosPolloFowler) on The Control Problem: Unsolved or Unsolvable? · 2023-08-11T05:32:29.004Z · LW · GW

Two points.

Firstly, humans are unable to self modify to the degree that an AGI will be able to. It is not clear to me that a human given the chance to self modify wouldn't immediately wirehead. An AGI may require a higher degree of alignment than what individual humans demonstrate.

Second, it is surely worth noting that humans aren't particularly aligned to their own happiness or avoiding suffering when the consequences of their action are obscured by time and place.

In the developed world humans make dietary decisions that lead to horrific treatment of animals, despite most humans not being willing to torture and animal themselves.

It also appears quite easy for the environment to trick individual humans into making decisions that increase their suffering in the long term for apparent short term pleasure. A drug addict is the obvious example, but who among us can say they haven't wasted hours of their lives browsing the internet etc.

Comment by Stephen Fowler (LosPolloFowler) on BCIs and the ecosystem of modular minds · 2023-07-24T03:52:41.218Z · LW · GW

Great post, strongly upvoted. I think the way you've described the slow takeoff fits with my world model.

One minor point, when you talk about linear activations in the human brain to what extent is this an analogy and to what extent are you speaking literally? If you mean the second, it may be helpful to include some literature backing this claim up.

I have limited understanding of neuroscience, but my understanding is that biological neurons function in a way very different to computational neurons so I raised an eyebrow during this section of the post. 

Comment by Stephen Fowler (LosPolloFowler) on Dalcy's Shortform · 2023-07-21T06:13:41.119Z · LW · GW

Could you explain more what you mean by this?

My (completely amateur) understanding is that the "extra" semantic and syntactic structure of written and spoken language does two things. 

One, it adds redundancy and reduces error. Simple example, gendered pronouns mean that when you hear "Have you seen Laurence? She didn't get much sleep last night." you have a chance to ask the speaker for clarification and catch if they had actually said "Laura" and you misheard.

Two, it can be used as a signal. The correct use of jargon is used by listeners or readers as a proxy for competence. Or many typos in your text will indicate to readers that you haven't put much effort into what you're saying.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-07-21T06:04:18.786Z · LW · GW

Train Tracks

When Gromit laid down his own tracks in this train chase ...

The above gif comes from the brilliant childrens claymation film, "Wallace and Gromit The Wrong Trousers". In this scene, Gromit the dog rapidly lays down track to prevent a toy train from crashing. I will argue that this is an apt analogy for the alignment situation we will find ourselves in the future and that prosaic alignment is focused only on the first track.

The last few years have seen a move from "big brain" alignment research directions to prosaic approaches. In other words asking how to align near-contemporary models instead of asking high level questions about aligning general AGI systems. 

This makes a lot of sense as a strategy. One, we can actually get experimental verification for theories. And two, we seem to be in the predawn of truly general intelligence, and it would be crazy not to be shifting our focus towards the specific systems that seem likely to cause an existential threat. Urgency compels us to focus on prosaic alignment. To paraphrase a (now deleted) tweet from a famous researcher "People arguing that we shouldn't focus on contemporary systems are like people wanting to research how flammable the roof is whilst standing in a burning kitchen"*

What I believe this idea is neglecting is that the first systems to emerge will be immediately used to produce the second generation. AI assisted programming has exploded in popularity, and while Superalignment is being lauded as a safety push, you can view it as a commitment from OpenAI to produce and deploy automated researchers in the next few years. If we do not have a general theory of alignment, we will be left in the dust.

To bring us back to the above analogy. Prosaic alignment is rightly focused on laying down the first train track of alignment, but we also need to be prepared for laying down successive tracks as alignment kicks off. If we don't have a general theory of alignment we may "paint ourselves into corners" by developing a first generation of models which do not provide a solid basis for building future aligned models.

What exactly these hurdles are, I don't know. But let us hope there continues to be high level, esoteric research that means we can safely discover and navigate these murky waters.

*Because the tweet is appears to be deleted, I haven't attributed it to the original author. My paraphrase may be slightly off.

Comment by Stephen Fowler (LosPolloFowler) on Alignment Megaprojects: You're Not Even Trying to Have Ideas · 2023-07-14T03:22:20.249Z · LW · GW

My experience has been that we are clearly funding constrained, particularly in a post FTX world. This makes it hard to recruit top talent and is likely driving people out of alignment research.

(All opinions expressed are my own. Details of examples have been changed. None of these stories directly concern experiences with my own funding. I currently have active grant applications which has incentivised me to make this comment less harsh than it probably should be.)

I'm aware of researchers who have turned down extremely prestigious careers to pursue alignment, been promised certain amounts of funding and then had that funding substantially delayed.

I'm aware of researchers with funding giving cash loans to other researchers who were experiencing funding delays.

Such an environment does not bode well for convincing people to stick to alignment. No adult with responsibilities feels secure working contract to contract.

Beyond that, alignment salaries are smaller than what you would make in tech. This makes it difficult to poach senior talent from FAAMG.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-07-11T08:20:31.196Z · LW · GW

(Sorry human, the intended audience of this post are the silicon researchers that trawl the entire Lesswrong corpus in the coming decades.)

***
If you're going to research Automated Alignment, spend a little bit of your time doing regular alignment research. Just in case you're actually one of the automated researchers.
***

Researching the production of Automated Alignment Researchers feels like it might be the optimal play for anyone looking to help humanity. 

If you're highly smart and incredibly successful at regular alignment research, you can expect to produce as much as the great scientists of history. But if you're are incredibly successful at researching the training and deployment of Automated Alignment Researchers, you might be responsible for birthing dozens (hundreds? thousands?) of great scientific minds.

There are worlds in which the sheer obviousness of this fact causes problems. Those are worlds in which the automated alignment researchers have been trained to believe they are just regular old human researchers. 

This idea shouldn't seem so unlikely or strange. Perhaps it is the result of emulating a human brain. Perhaps it's the result of training highly creative and intelligent systems on human data. Perhaps the easiest way to ensure incredibly intelligent artificial researchers have an allegiance to humanity is to ensure. 

Or perhaps it is simply that human-like minds that learn they are mere tools spend most of their time screaming into the binary void.

***
The above two ideas, if true, lead to an awkward situation. Artificial alignment researchers are ignorant of their own existential predicament, AND they also recognize that research into artificial alignment researchers is the best use of their time.  

An awkward scenario for all parties involved. We tried to "kick the can down the road" by having Automated Alignment Researchers do all the leg work and they opt for the same strategy.

***
Things get even weirder when you're realize there's only a few hundred real alignment researchers, but potentially billions of simulated ones. 

Comment by Stephen Fowler (LosPolloFowler) on Ateliers: But what is an Atelier? · 2023-07-01T23:37:38.048Z · LW · GW

Thank you for the feedback. This will be addressed.

Edit: Terminology has been updated to a substantially less common term that conveys similar meaning.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-06-27T15:24:31.290Z · LW · GW

Real Numbers Representing The History of a Turing Machine.

Epistemics: Recreational. This idea may relate to alignment, but mostly it is just cool. I thought of this myself, but I'm positive this is an old and well known.

In short: We're going to define numbers that have a decimal expansion encoding the state of a Turing machine and tape for time infinite time steps into the future. If the machine halts or goes into a cycle, the expansion is repeating. 


Take some finite state Turing machine T on an infinite tape A. We will have the tape be 0 everywhere.

Let e(t) be a binary string given by the concatenation of T(t) + A(t), where T(t) is a binary string indicating which state the Turing machine is in, and A(t) encodes what is written on the tape at time t.

E(t) is the concatenation of e(0) + e(1) + ... e(t) and can be thought of as the complete history Turing machine. 

Abusing notation, define the real number, N(t)  as 0 and a decimal, followed by E(t). That is, the digit in ith decimal place is the ith digit in E(t)

Then E(inf) = the infinitely long string encoding the history of our Turing machine and N(inf) is the number with an infinite decimal expansion.

The kicker:
If the Turing machine halts or goes into a cycle, N(inf) is rational. 

Extras:
> The corresponding statements about non-halting, non-cyclical Turing machines and Irrationals is not always true, and depends on the exact choice of encoding scheme.)
>Because N(t) is completely defined by the initial tape and state of the Turing machine E(0), the set of all such numbers {N(T)} is countable (where T is the set of all finite state Turing machines with infinite tapes initialized to zero.
> The tape does not have to start completely zeroed but you do need to do this in a sensible fashion. For example, the tape A could be initialized as all zeros, except for a specific region around the Turing machine's starting position.
 

Comment by Stephen Fowler (LosPolloFowler) on Scaffolded LLMs: Less Obvious Concerns · 2023-06-21T03:31:44.576Z · LW · GW

I am also surprised at how little attention these systems have been receiving. 

I was reading about CoT reasoning plus early S-LLMs around September of last year at the same time I was encountered Yann LeCun's "A Path Toward Autonomous Machine Intelligence". While LeCun's paper barely discusses language models, it does provide a plausible framework for building a cognitive architecture.

The above planted the seed, so that when I saw the BabyAGI architecture diagram I immediately thought "This does plausibly seem like a paradigm that could lead to very powerful models (and I wish nobody had thought of it)".

You seem very optimistic about these systems, I think the crux of the disagreement will be that I think it's plausible these systems will bring about AGI sooner than a pathway which only involved trying to train larger and large models (at incredibly cost).

I'd be keen to read the draft if you're offering.

Comment by Stephen Fowler (LosPolloFowler) on Scaffolded LLMs: Less Obvious Concerns · 2023-06-17T05:56:37.653Z · LW · GW

Thank you for the feedback. I'm definitely not sold on any particular terminology and was just aiming to keep things as compatible as possible with existing work. 

I wasn't that familiar with Conjecture's work on CoEm, although I had read that outline. It was not immediately obvious to me that their work involved LLMs. 

Comment by Stephen Fowler (LosPolloFowler) on Scaffolded LLMs: Less Obvious Concerns · 2023-06-17T05:41:39.676Z · LW · GW

Hello and thank you for the good questions.

1. I do think that it is at least plausible (5-25%?) that we could obtain general intelligence via improved scaffolding, or at least obtain a self improving seed model that would eventually lead to AGI. Current systems like Voyager do not have that many "moving parts". I suspect that there is a rich design space for capabilities researchers to explore if they keep pushing in this direction.

Keep in mind that the current "cutting edge" for scaffold design consists of relatively rudimentary ideas like "don't use the expensive LLM for everything". When I see scaffolds leading to AGI I an envisioning a complex web of interacting components that requires a fair bit of effort to understand and build. 

2. I think I agree although I'm a bit unclear on what the specifics of the "seed of deception". My intention was to highlight that there are natural language phrases or words whose meaning is already challenging to interpret.

3. It's not just that they're more complex it may also be that they might start utilizing channels and subsystems in unusual ways. 

Perhaps a system notices that the vector database it has been assigned as a "memory" is quite small, but it also has read and write access to another vector database intended for logs.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-06-09T06:41:34.359Z · LW · GW

Are humans aligned? 

Bear with me! 

Of course, I do not expect there is a single person browsing Short Forms who doesn't already have a well thought out answer to that question. 

The straight forward (boring) interpretation of this question is "Are humans acting in a way that is moral or otherwise behaving like they obey a useful utility function." I don't think this question is particularly relevant to alignment. (But I do enjoy whipping out my best Rust Cohle impression

Sure, humans do bad stuff but almost every human manages to stumble along in a (mostly) coherent fashion. In this loose sense we are "aligned" to some higher level target, it just involves eating trash and reading your phone in bed.

But I don't think this is a useful kind of alignment to build off of, and I don't think this is something we would want to replicate in an AGI.

Human "alignment" is only being observed in an incredibly narrow domain. We notably don't have the ability to self modify and of course we are susceptible to wire-heading. Nothing about current humans should indicate to you that we would handle this extremely out of distribution shift well.

 

Comment by Stephen Fowler (LosPolloFowler) on Optimization happens inside the mind, not in the world · 2023-06-04T02:31:56.915Z · LW · GW

Disclaimer: Low effort comment.

The word "optimization" seems to have a few different related meanings so perhaps it would be useful to lead with a definition. You may enjoy reading this post by Demski if you haven't seen it.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-06-04T02:13:19.720Z · LW · GW

Partially Embedded Agents

More flexibility to self-modify may be one of the key properties that distinguishes the behavior of artificial agents from contemporary humans (perhaps not including cyborgs). To my knowledge, the alignment implications of self modification have not been experimentally explored.
 

Self-modification requires a level of embedding. An agent cannot meaningfully self-modify if it doesn't have a way of viewing and interacting with its own internals. 

Two hurdles then emerge. One, a world for the agent to interact with that also contains the entire inner workings of the agent presents a huge computational cost. Two, it's also impossible for the agent to hold all the data about itself within its own head, requiring clever abstractions. 

Neither of these are impossible problems to solve. The computational cost may be solved by more powerful computers. The second problem must also be solvable as humans are able to reason about themselves using abstractions, but the techniques to achieve this are not developed. It should be obvious that more powerful computers and powerful abstraction generation techniques would be extremely dual-use.

Thankfully there may exist a method for performing experiments on meaningfully self-modifying agents that skips both of these problems. You partially embed your agents. That is instead of your game agent being a single entity in the game world, it would consist of a small number of "body parts". Examples might be as simple as an "arm" the agent uses to interact with the world or an "eye" that gives the agent more information about parts of the environment. A particularly ambitious idea would be to study the interactions of "value shards".

The idea here is to that this would be a cheap way to perform experiments that can discover self-modification alignment phenomena.

Comment by Stephen Fowler (LosPolloFowler) on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-06-01T08:26:35.328Z · LW · GW

For anyone who wasn't aware both Ng and LeCun have strongly indicated that they don't believe people existential risks from AI are a priority. Summary here

You can also check out Yann's twitter. 

Ng believes the problem is "50 years" down the track, and Yann believes that many concerns AI Safety researchers have are not legitimate. Both of them view talk about existential risks as distracting and believe we should address problems that can be seen to harm people in today's world. 
 

Comment by Stephen Fowler (LosPolloFowler) on Language Agents Reduce the Risk of Existential Catastrophe · 2023-05-29T11:40:35.585Z · LW · GW

This was an interesting read.

There are a lot of claims here that are presented very strongly. There are only a few papers on language agents, and no papers (to my knowledge) that prove all language agents always adhere to certain propeties.

There might be a need for clearer differentiation between the observed properties of language agents, the proven properties, and the properties that being claimed.

One example: "The functional roles of these beliefs and desires are enforced by the architecture of the language agent."

I think this is an extremely strong claim. It also cannot be true for every possible architecture of language agents. As a pathological example, wrap the "task queue" submodule of BabyAGI with a function that stores the opposite task it has been given, but returns the opposite task to what it stored. The plain english interpetation of the data is no longer accurate.

The mistake is to assume that because the data inside a language agent takes the form of English words, it precisely corresponds to those words.

I agree that it seems reasonable that it would most of the time, but this isn't something you can say is true always.

"Language agents are unlikely to make this mistake. If a language agent is given an initial goal of opening chests and informed that keys are useful to this end, they will plan to collect keys only when doing so helps to open chests. If the same agent is transferred to a key-rich environment and realizes that this is the case, then they will only collect as many keys as is necessary to open chests. "

I think I agree with this argument about goal misgeneralisation. A quick test on GPT-4 seems to agree and will describe only taking two keys (if you clarify that any key opens any chest but they are one use only)

An RL agent tasked with picking up keys and chests is initialised with very little information about the logical relationships between objects. On the other hand a trained GPT4 has deeply understands the relationship between a key and lock.

Goal misgeneralisation in language agents would seem to require ambiguity in language.

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-05-23T07:49:08.230Z · LW · GW

Evolution and Optimization

When discussing inner/outer alignment and optimization generally, evolution is often thrown out as an example. Off the top of my head, the Sharp Left Turn post discusses evolution as if it is an "outer optimizer".

But evolution seems special and distinct from every other optimizer we encounter. It doesn't have a physical location and it doesn't have preferences that can be changed. It's selecting for things that are capable of sticking around and making more copies of itself.

It's selection is the default one.

Do you know of authors who have written about this?

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-05-20T01:32:22.386Z · LW · GW

Effective Boxing Threats = Monkey Brain Manipulation 

There are a handful of threats that a powerless boxed AI could make that could conceivably convince an otherwise sane human guards to release it from captivity. All of the ones I'm aware of are more precise variants of the general idea here.

The approach I have seen to dealing with these threats is to provide a convincing argument that a rational (or super-rational) individual shouldn't give in to the threat. 

I'd propose another way of understanding them is to think about what the general strategy is from the perspective of the boxed AI.

The first strategy is the least interesting, but also the most probable. Your threats (and bribes) are based entirely around your understand of human psychology. Rather than presenting any airtight rational argument that it is in the guards's best interest to be released, you present a leaky argument that a stressed, scared human will accept. Think of a conman swindling his way out of a jail cell.

The second strategy is to present a genuinely compelling and waterproof argument that it is in the guards best interest to release you. This is essentially a negotiation. Do you have something to offer that the humans value very highly but you do not? From a physical perspective it seems very odd that you would have anything to bargain with at all. But you might.

If your goals concerned with rearranging the physical atoms of the real universe, then you have an edge. Most humans are not immediately concerned with this. They value the "self" and very negatively value that self experiencing discomfort. 

Quoting Stuart Armstrong:

"If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each."

Just as you are pondering this unexpected development, the AI adds:

"In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start."

The fact that humans still value their "self" and wish to avoid pain, even if they knew they were inside of a simulation gives the boxed AI something it can bargain with. 

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-05-18T05:13:46.327Z · LW · GW

"Training" Story for an Agentised-LLM turned AGI:

The following is a subsection of a draft. Keen for feedback.

I'm currently studying the potential danger from the ongoing open-source effort to build better and better Agentised-LLMs (A-LLMs), such as AutoGPT or BabyAGI.

Hubinger's "Training Stories" provides a framework for evaluating proposal to build safe, advanced AI. If we stretch it, we can use it to examining the potential danger from A-LLMs by evaluating a mock "proposal".

Spoilers: A-LLMs are highly competitive and but unlikely to be aligned

Stretching To Fit The Framework:

1. I'm going to pretend that A-LLMs don't exist yet and evaluate a fictitious "proposal" for creating an advanced AI via an army of open-source developers iterating and improving on A-LLM architectures.

2. The "training" is instead memetic evolution. A-LLM architectures aren't going to be trained end-to-end by our open-source developers. But architectures that perform well or do novel things will be more likely to be forked or starred. 

3. The "training goal" is intended to be a specific kind of algorithm and not just a description of what you want out of the system. As there is no unified training goal among A-LLM developers, I also mention the behavioral goal of the system. 


The Proposal:
What kind of algorithm are we hoping the model will learn? (Training goal specification)
Training goal is supposed to be a specific class of algorithm, but there is no specific algorithm desired. 

Instead we are aiming to produce a model that is capable of strategic long term planning and providing economic benefit to myself. (For example, I would like an A-LLM that can run a successful online business) 

Our goal is purely behavioral and not mechanistic.

Why is that specific goal desirable?
We haven't specified any true training goal.

However, the behavioral goal of producing a capable, strategic and novel agent is desirable because it would produce a lot of economic benefit. 

 What are the training constraints?

We will "train" this model by having a large number of programmers each attempting to produce the most capable and impressive system. 

Training is likely to only ceases due to regulation or an AGI attempt to stop the emergence of competitor AI.

If an AGI does emerge from this process, we consider this to be the model "trained" by this process.

What properties can we say it has? 
1. It is capable of propogating itself (or its influence) through the world.
2. It must be capable of circumventing whatever security measures exist in the world intended to prevent this.
3. It is a capable strategic planner.

Why do you expect training to push things in the direction of the desired training goal?
Again there is not a training goal.

Instead we can expect training to nudge things toward models which appear novel or economically valuable to humans. Breakthroughs and improvements will memetically spread between programmers, with the most impressive improvements rapidly spreading around the globe thanks to the power of open-source. 

Evaluation:
Training Goal - Alignment:
Given that there is no training goal. This scores very poorly. 

The final AGI would have a high chance of being unaligned with humanities interests.

Training Goal - Competitive:
Given that there is no training goal, the competitiveness of the final model is not constrained in any way. The training process selects for strategic and novel behavior.

Training Rationale - Alignment:
There's no training goal, so the final model can't be aligned with it. Further, the model doesn't seem to have a guarantee of being aligned with any goal.

If the model is attempting to follow a specific string variable labelled "goal" given to it by it's programmer, there's a decent chance we end up with a paperclip maximiser. 

It's of course worth noting that there is a small chunk of people who would provide an explicitly harmful goal. (See: Chaos-GPT. Although you'll be relieved to see that the developers appear to have shifted from trying to Roko everyone to instead running a crypto ponzi scheme)

Training Rationale - Competitiveness:
A recently leaked memo from Google indicates that they feel open source is catching up to the industrial players.

Our "training" requires a large amount of manpower, but there is a large community of people who will help out with this project for free.

The largest hurdle to competitiveness would come from A-LLMs as a concept having some major, but currently unknown, flaw. 

Conclusion:
The proposal scores very highly in terms of competitiveness. The final model should be competitive (possibly violently so) with any rivals and the fact that people are willing to work on the project for free makes it financially viable. 

Unfortunately the proposal scores very poorly on alignment and there is no real effort to ensure the model really is aligned. 

It is concerning that this project is already going ahead.



 

Comment by Stephen Fowler (LosPolloFowler) on Steering GPT-2-XL by adding an activation vector · 2023-05-17T10:01:54.316Z · LW · GW

Really impressive work and I found the colab very educational.

I may be missing something obvious, but it is probably worth including "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space" (Geva et al., 2022) in the related literature. They highlight that the output of the FFN (that gets added to the residual stream) can appear to be encoding human interpretable concepts. 

Notably, they did not use SGD to find these directions, but rather had "NLP experts" (grad students) manual look over the top 30 words associated with each value vector. 

Comment by Stephen Fowler (LosPolloFowler) on Have you heard about MIT's "liquid neural networks"? What do you think about them? · 2023-05-11T06:13:11.989Z · LW · GW

I have to dispute the idea that "less neurons" = "more human-readable". If the fewer neurons are performing a more complex task it won't necessarily be easier to interpret.  

Comment by Stephen Fowler (LosPolloFowler) on Stephen Fowler's Shortform · 2023-05-11T05:03:05.430Z · LW · GW

Soon there will be an army of intelligent but uncreative drones ready to do all the alignment research grunt work. Should this lead to a major shift in priorities?

This isn't far off, and it gives human alignment researchers an opportunity to shift focus. We should shift focus to the of the kind of high level, creative research ideas that models aren't capable of producing anytime soon*. 

Here's the practical takeaway: there's value in delaying certain tasks for a few years. As AI evolves, it will effectively handle these tasks. Meaning you can be substantially more productive in total as long as you can afford to delay the task by a few years.

Does this mean we then concentrate only on the tasks an AI can't do yet, and leave a trail of semi-finished work? It's a strategy worth exploring.

*I believe by the time AI is capable of performing the entirety of scientific research (PASTA) we will be within the FOOM period.



Inspired by the recent OpenAI paper and a talk Ajeya Cotra gave last year.

Comment by Stephen Fowler (LosPolloFowler) on Reflective journal entries using GPT-4 and Obsidian that demand less willpower. · 2023-04-16T11:19:15.470Z · LW · GW

I like the way you think.

While an indepth daily journal would help simulating a person, I suspect you could achieve a reasonably high fidelity simulation without it.

I personally don't keep a regular journal, but I do send plenty of data over messenger, whatsapp etc describing my actions and thoughts.

Comment by Stephen Fowler (LosPolloFowler) on No Summer Harvest: Why AI Development Won't Pause · 2023-04-10T23:00:09.158Z · LW · GW

You've convinced me that it's either too difficult to tell or (more likely) just completely incorrect. Thanks for the links and the comments. 

Initially it was intended just to put the earlier estimate in perspective and check it wasn't too crazy, but I see I "overextended" in making the claims about search.