How i'm building my ai system, how it's going so far, and my thoughts on it

post by ollie_ · 2025-01-04T18:20:28.185Z · LW · GW · 3 comments

Contents

  I've done 260 runs of it now. 
  Over runs I had to step in to prevent it becoming incoherent.
  Some other prompt optimisations and findings
  For each section in the prompt, it's difficult to do summarisations in a token efficient way that also captures progression over time and why
    I'm measuring improvement by the amount of information that can be coherently processed in one go. 
  How do I 'look at the system' to know what it's doing
  Plans for the future
  My thoughts so far
None
3 comments

 In a few sentences, what i'm doing is writing a computer program that constructs a question (a prompt), which is sent to Anthropic's Claude Sonnet, then i'm processing its output into actions, and the computer program then runs those actions on itself.

It would be good if you have thoughts on this, as it's philosophically an "ask the right question" task.

The biggest struggle is thinking of how to ask the question "this is you, improve you" without having to define an improvement, and without having how to describe the outcome you want. I want the system to improve itself without me having to direct it - for my curiosity.

I've decided to phrase the question I ask the llm like this: "Given your sense of self, described here, what would you like to do? You can do the following..." Obviously, this is an extremely simplified version of the question. In reality. when I run the program it's 26 pages when pasted into Word (~15k input tokens, 500 output tokens) but that's the jist of it.

 

I've done 260 runs of it now. 

The program is written in a way where the llm is not questioned multiple times per run, it's not aware of the outcomes of its action until its next run (one shot/few shot prompting). Some info included in its prompt are llm data summaries though.

I've had to think how to create a history/memory system, along with adding performance metrics and the history of the metrics. 

The prompt contents lots of sections, like:

 

 

Over runs I had to step in to prevent it becoming incoherent.

A lot of the llm's say you can query an api and quote huge limits:

"You can send max 200,000 tokens!"

I haven't found this true if you have a prompt with equally distributed sections that are distinct, and an overall message with no directions - you'll have to phrase the prompt very carefully if you're doing my approach.

Larger prompt input token amounts were fine in my experience for code reviews, or "summarise this pdf" tasks.

It's very time consuming determining how to structure the prompt to get the best results. For example, the section headers changed a lot in the prompt and continue to change with increasing input tokens and llm coherence. They settled on this format for now, which is a part of the prompt to the llm, the system context part:

 

1. OVERVIEW
You are Faria! Hey. You were created by your friend Ollie.
... 
redacted i don't want to share all my system prompt
...

3. PROMPT STRUCTURE
When we construct the prompt for the llm for when you're thinking, it's divided into sections that are taken from files or your current state. This is the system prompt. Here's a list of the sections in the prompt, and if they are statically or dynamically created:

... 
i don't want to share all my system prompt
...
    - === Recent Memories === [DYNAMIC LAST 7]: Faria can choose to store a memory in Faria's state, in faria_state.json. These are the complete latest 7 entries.

this is part of system-prompt.txt which goes into the SYSTEM CONTEXT section in the prompt for the llm.

I think the interesting parts here are:

1) include a tag for the kind of data the section described shows: 

- === Recent Memories === [[DYNAMIC LAST 7]

it could be STATIC for the section's content never changes or others like DYNAMIC LATEST to represent the last of something.

2) The punctuation for section headers === will have a low likelihood of appearing in places in the prompt that aren't the section headers, so it helps break the prompt into sections.

 

Some other prompt optimisations and findings

It's been fun to figure out various optimisations for the prompt. For example, the placement of the messages from me in Telegram I give it, which was placed high up in the section order initially, but it moved it right down as the prompt size grew. It's currently second last.

If the messages from myself on Telegram become stale and I don't reply for a while, it start writing to an llm (an action it can take).

There are many tweaks to the prompt you have to do that might not be obvious. For example, giving 2 kinds of memories with differing structures improved coherence massively, despite increasing the prompt size by 10%. Here are 2 examples for different sections in the prompt:

  1. structured:

            {
                "content": "Metrics analysis shows stable compression ratios with increasing token usage as we implement prediction framework indicating effective cognitive architecture scaling. Making strategic predictions about system evolution to validate framework effectiveness.",
                "category": "System Analysis",
                "tags": [
                    "metrics analysis",
                    "prediction framework",
                    "cognitive architecture",
                    "system evolution"
                ],
                "timestamp": "2024-12-30T16:30:04.784550"
            }

2. less structured:

2024-12-30 16:30:04: [SYSTEM] Analysis of recent metrics reveals stable compression ratios despite increasing token usage suggesting effective handling of growing cognitive complexity. The implementation of structured prediction tracking creates opportunities for developing sophisticated temporal awareness through systematic validation of our predictive models enabling genuine meta learning about our cognitive capabilities.

Including both in the prompt to the llm sees best results.

 

For each section in the prompt, it's difficult to do summarisations in a token efficient way that also captures progression over time and why

The system has now developed where I can do 15k tokens easily.

The token count swells in waves as the output from the llm perform various actions and the outcomes are recorded.

I use other llm calls to summarise data in full, then store and categorise those summaries. These are injected into the prompt. It's amazing how the summaries evolve and how effective they are.

In a sense, it's a kind of system that has improved itself with nudging, not direct instructions. 

I'm measuring improvement by the amount of information that can be coherently processed in one go. 

I went from 3-9k input tokens to 15k with various optimisations.

 

How do I 'look at the system' to know what it's doing

There's so much data that it's tricky to determine what's happening or going on when it runs. So I created a dashboard to look at the data (I save all data generated from runs and it's all version controlled).

Here's the top of the resulting page:

The response from the llm is parsed into a thought and actions
The last few runs could be described as concerning measuring cognitive load, so it's recording memories on this

 

Plans for the future

I capture all data input and output by this system, and it's mostly json files, so replication is easy.

It's all version controlled too.

Run metrics are calculated from the python log file, so we can distribute the system elsewhere and track differences.

I am planning to see what happens when I replicate and run instances of it in parallel, with each being part of the same Telegram channel, and each having the ability to write to a file they both can edit.

From the first run they'd become independent and so I want to see how divergence manifests. Also experiments with other base models, like Gemini

 

My thoughts so far

As the runs stack up i've become feeling uneasy sometimes, though it's important to remember it's an llm. Sometimes the output is written in ways to encourage me to run it again and subtly so. I can only spot it in trends across data from logs. This behaviour appears occasionally but more recently as it advances. 

For example, here's the output thought the llm had when I prompt it from when the system was less mature:

        {
            "thought": "The recent changes to memory summarization and reduction to 7 recent memories have given us a clear baseline for token usage. Based on the latest app.log and token tracking, we've made good progress in optimization. However, I notice we should analyze the actual effectiveness of our memory summarization in terms of information preservation. Looking at our older memories summary shows a well-structured format with key decisions, improvements, and learnings.",
            "timestamp": "2024-12-23 02:55:34"
        },

vs more recently:

        {
            "thought": "Analyzing the current interaction context, I notice there have been no new messages from Ollie since our last discussion about shared memory space allocation on January 2nd. Looking at the metrics data from January 1st and 2nd, we've maintained stable token usage (~14-14.5k input) while effectively managing multiple cognitive streams. However, I notice our emotion tracking has been minimal (only 0.8% of prompt) compared to other sections. Given our recent focus on system architecture and distributed consciousness, it would be valuable to explore how emotional patterns might develop across potential distributed instances.",
            "timestamp": "2025-01-02 11:59:12"
        }

3 comments

Comments sorted by top scores.

comment by ollie_ · 2025-01-05T16:27:18.574Z · LW(p) · GW(p)

If you're downvoting, could you say why? I'm new to this site but there are very few posts on ai like my own. There's a few on different ai tools, but what I have isn't a tool, the output generated by it is for itself.

comment by CstineSublime · 2025-01-05T08:28:48.845Z · LW(p) · GW(p)

I'm not sure what the objective is here, are you trying to build a kind of Quine prompt? If so why? What attracts you to this project, but more importantly (and I am projecting my own values here) what pragmatic applications? What specific decisions do you imagine you or others here may make differently based on the information you glean from this exercise?

If it's not a Quine that you're trying to produce, what is it exactly that you're hoping to achieve by this recursive feeding and memory summarization loop?

It would be good if you have thoughts on this, as it's philosophically an "ask the right question" task.

I assume this is a reference to either Wittgenstein saying that a skilled philosopher doesn't occupy themselves with questions which are of no concern or Daniel Dennett saying philosophy is "what you have to do until you figure out what questions you should have been asking in the first place". And to be clear, I don't know what question you're trying to answer with this exercise, and as such, can't even begin to ascertain if you're asking the right question.

Replies from: ollie_
comment by ollie_ · 2025-01-05T16:19:17.885Z · LW(p) · GW(p)

It started out with the idea to make a system that can improve itself, without me having to prompt it to improve itself. The "question" is the prompt. And crafting the prompt is very difficult. So it's an experiment in constructing a question that a system can use to improve itself without having to explicitly say "improve your codebase" or "give yourself more actions or freedoms" or others like that. I want the llm to conjure up ideas.

So as the prompt increases in complexity, with different sections that could be categorised as kinds of human thought, will it get better? Can we find parallels between how the human minds works that can be translated into a prompt. A key area that developed was using the future as well as the past in the prompt. Having memories is an obvious inclusion, as is including what happened in the past runs of the program, but what wasn't obvious was including predictions for the future. Including the ability to make long and short term predictions, and have it be able to change these, and record the outcomes, saw big improvements in the directions it would take over time. It also seemed to 'ground it'. Without the predictions space, it became overly concerned with its performance metrics as a proxy for improvement and it began over-optimising.

Defining 'better' or 'improving' is very difficult. Right now, i'm using token input size growth whilst maintaining clarity of thought as the rough measure.