Readability is mostly a waste of characters

post by vlad.proex · 2023-04-21T22:05:34.907Z · LW · GW · 7 comments

Contents

  Readability and information
  Compressing prompts
  Conclusion
None
7 comments

Based on this research, I have created a free web app for compressing GPT prompts at gptrim.com. gptrim is also a Python library. Contributions are welcome.

Readability and information

The other day I was trying to feed a large text to ChatGPT, and I grew frustrated with the limits of the chat message size. After some brainstorming with Bing, I had an insight. 

Not all tokens in a text convey information. Some are there to make it easier for humans to parse text. Consider spaces: they are mostly used to tokenize words for the human visual cortex. ButifIwrotewithoutspacesyouwouldprobablystillunderstandme. 

There are very few cases where a missing space would cause ambiguity. The nicest one ChatGPT could come up with was: 

"Iamnowhere" could be interpreted as "I am nowhere" or "I am now here," which have opposite meanings.

Even in such cases, the ambiguity would probably be solved by context.

What about punctuation? Again, I suspect there are very few cases where removing punctuation would create an ambiguity that cannot be solved by context. ChatGPT:

"Let's eat grandma" vs. "Let's eat, grandma." Without the comma, it could be misinterpreted as a request to eat the grandmother.

Can we go further? We could apply word stemming. This would significantly reduce our character count. How much information would we sacrifice? Try to read this snippet (full text below). 

a machin learn research write me in respons to yesterday ’ s post , say : i still think gpt - 2 is a brute - forc statist pattern matcher which blend up the internet and give you back a slight unappet slurri of it when ask . i resist the urg to answer “ yeah , well , your mom is a brute - forc statist pattern matcher which blend up the internet and give you back a slight unappet slurri of it when ask .” but i think it would have been true. (source)

 A machine learning researcher writes me in response to yesterday’s post, saying:

I still think GPT-2 is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.

I resisted the urge to answer “Yeah, well, your mom is a brute-force statistical pattern matcher which blends up the internet and gives you back a slightly unappetizing slurry of it when asked.”

But I think it would have been true.

I find that "caveman Scott" is still very much readable, especially if I know the context. 

Compressing prompts

Consider this snippet from Scott:

“But don’t humans also have genuinely original ideas?” Come on, read a fantasy book. It’s either a Tolkien clone, or it’s A Song Of Ice And Fire. Tolkien was a professor of Anglo-Saxon language and culture; no secret where he got his inspiration. A Song Of Ice And Fire is just War Of The Roses with dragons. Lannister and Stark are just Lancaster and York, the map of Westeros is just Britain (minus Scotland) with an upside down-Ireland stuck to the bottom of it – wake up, sheeple! Dullards blend Tolkien into a slurry and shape it into another Tolkien-clone. Tolkien-level artistic geniuses blend human experience, history, and the artistic corpus into a slurry and form it into an entirely new genre. Again, the difference is how finely you blend and what spices you add to the slurry.

Let's remove spaces and punctuation and apply Snowball stemming. As an additional (and perhaps excessive) optimization, let's also remove stopwords

We get something that looks like this:

donthumanalsogenuinoriginidea?”Come,readfantasibook.eitherTolkienclone,SongIceFire.TolkienprofessorAnglo-Saxonlanguagcultur;nosecretgotinspir.SongIceFireWarRosedragon.LannistStarkLancastYork,mapWesteroBritain(minusScotland)upsiddown-irelandstuckbottom–wake,sheepl!DullardblendTolkienslurrishapeanothtolkien-clon.tolkien-levelartistgeniusblendhumanexperi,histori,artistcorpusslurriformentirnewgenr.,differfineblendspiceaddslurri.

Let's feed it into GPT-4 with the following prompt:

This is an instance of compressed text. Rewrite it so that it has perfect grammar and is understandable by a human. Try to interpret it as faithfully as possible. Do not paraphrase or add anything to the text.

We get back this (I have bolded two sentences where I felt some meaning was lost):

Do humans also have a genuine origin for ideas? Come and read fantasy books, either by Tolkien or those in the Song of Ice and Fire series. Tolkien was a professor of Anglo-Saxon language and culture, so it's no secret where he got his inspiration from. The Song of Ice and Fire series is based on the War of the Roses, with dragons added. The Lannisters and the Starks are similar to the Lancasters and the Yorks, and the map of Westeros resembles Britain (minus Scotland) upside down, with Ireland stuck to the bottom – wake up, people! A dullard might blend Tolkien's ideas into a slurry and shape it into another Tolkien clone. However, a Tolkien-level artist and genius would blend human experience, history, and an artistic corpus into a slurry to create an entirely new genre. It is the difference between a fine blend of spices and a simple slurry.

Overall, I think it's a pretty good rendition. But much did we save? The original text has 187 tokens and 790 characters. The compressed text has 160 tokens and 428 characters. That's a 14% saving on tokens and a 46% saving on character count. 

Overall, it seems this approach might be more useful for fighting ChatGPT and Bing's message character limit than reducing tokens. But even a 14% saving is significant if you are using GPT in production. 

Conclusion

A few weeks ago Twitter discovered Shoggoth tongue, a technique for compressing GPT prompts by... asking GPT to do it. Shoggoth tongue may save a lot of tokens, but you also have to spend tokens to compress the text, so it only makes sense for crafting reusable prompts. I don't believe it applies to the majority of use cases, i.e. people solving mundane problems with ChatGPT. 

Applying the standard NLP preprocessing routine to GPT prompts is a quick and free way to reduce character count and save some tokens while seemingly preserving most information. 

I haven't had time to experiment extensively with prompt compression. I also did not try to quantify these results. I look forward to seeing new ideas in this space.

7 comments

Comments sorted by top scores.

comment by Emily Thomas (emily-thomas) · 2023-04-22T11:15:21.383Z · LW(p) · GW(p)

Good Idea! Let's make it better!
Main thing that comes to mind, a lot of tokens already come with spaces at the start of the word. Would removing them make things worse?

I put 'ButifIwrotewithoutspacesyouwouldprobablystillunderstandme' into openai's tokenizer.
It has 17 tokens.
I added the spaces back in to get 'But if I wrote without spaces you would probably still under stand me'.
It has 13.
Okay, so taking spaces out made it longer instead of shorter, but it's also a short bit of text and could be a fluke. What about your Tolkien example?

Your original Tolkien text has 187 tokens. Your compressed version without spaces has 160.
If you add spaces back in it only has 132 tokens (and 508 characters).

This is what I used:
"dont human also genuin origin idea? ”Come, read fantasi book. either Tolkien clone, Song Ice Fire. Tolkien professor Anglo-Saxon languag cultur; no secret got inspir. Song Ice Fire War Rose dragon. Lannist Stark Lancast York, map Westero Britain (minus Scotland) upsid down - ireland stuck bottom – wake, sheepl! Dullard blend Tolkien slurri shape anoth tolkien - clon. tolkien - level artist genius blend human experi, histori, artist corpus slurri form entir new genr. , differ fine blend spice add slurri."

Replies from: vlad.proex
comment by vlad.proex · 2023-04-23T14:32:28.559Z · LW(p) · GW(p)

This is what I was hoping for when I wrote this post. Thank you for your insight. 

New position: sometimes when using ChatGPT, you only care about the number of characters, because of the character limit in the chat message. In that case, you want to get rid of spaces. But if you want to save on tokens, you probably should keep spaces. I think the solution is: a) allow the user to choose the mix of transformations for their use case; b) show them how much they are saving in characters and tokens so they can optimize for their use case. 

Replies from: emily-thomas
comment by Emily Thomas (emily-thomas) · 2023-04-24T11:19:53.497Z · LW(p) · GW(p)

Oh, if we're only optimizing for tokens we can get the Tolkien example down from 187 to 110.

Word stemming adds tokens (while reducing characters). If we only care about tokens then removing the stopwords was doing all the work.

If we only remove the stopwords and nothing else we get:
don't humans also genuinely original ideas? Come, read fantasy book. either Tolkien clone, Song Ice Fire. Tolkien professor Anglo-Saxon language culture; no secret got inspiration. Song Ice Fire War Roses dragons. Lannister Stark Lancaster York, map Westeros Britain (minus Scotland) upside down-Ireland stuck bottom - wake, sheeple! Dullards blend Tolkien slurry shape another Tolkien-clone. Tolkien-level artistic geniuses blend human experience, history, artistic corpus slurry form entirely new genre. , difference finely blend spices add slurry

Which GPT-4 outputs to:
Don't humans also have genuinely original ideas? Come, let's read a fantasy book. It could either be a Tolkien clone or a Song of Ice and Fire. Tolkien was a professor of Anglo-Saxon language and culture; it's no secret where he got his inspiration. Song of Ice and Fire is based on the War of the Roses with dragons. Lannister and Stark are analogous to Lancaster and York, and the map of Westeros resembles Britain (minus Scotland) upside down, with Ireland stuck at the bottom - wake up, sheeple! Dullards merely blend Tolkien's ideas into a slurry to shape another Tolkien clone. However, Tolkien-level artistic geniuses blend human experience, history, and artistic corpus into a slurry to form an entirely new genre. It's like the difference between finely blending spices and merely adding them to a slurry.


This can probably be refined further.

Replies from: emily-thomas
comment by Emily Thomas (emily-thomas) · 2023-04-24T12:05:56.154Z · LW(p) · GW(p)

This can probably be refined further.


Which I then went and did, maybe.
You can actually get it down from 187 tokens to only 87, by also removing all the punctuation.

This gave only a slightly higher loss in accuracy than other examples, and seemed to preserve the main information. Could be an optional extra.

Compressed version:
don't humans also genuinely original ideas Come read fantasy book either Tolkien clone Song Ice Fire Tolkien professor Anglo Saxon language culture no secret got inspiration Song Ice Fire War Roses dragons Lannister Stark Lancaster York map Westeros Britain minus Scotland upside down Ireland stuck bottom wake sheeple Dullards blend Tolkien slurry shape another Tolkien clone Tolkien level artistic geniuses blend human experience history artistic corpus slurry form entirely new genre difference finely blend spices add slurry

Which GPT-4 outputs to:
Don't humans also generate genuinely original ideas? When you read a fantasy book, it is either a Tolkien clone or a Song of Ice and Fire (Game of Thrones) adaptation. Tolkien, a professor of Anglo-Saxon language and culture, made no secret of where he got his inspiration. The Song of Ice and Fire series is based on the War of the Roses, with dragons and houses like Lannister and Stark representing Lancaster and York, respectively. The map of Westeros is essentially Britain without Scotland, flipped upside down, and Ireland attached at the bottom. In this case, it seems that unimaginative people merely blend Tolkien's work and create another Tolkien clone.
However, Tolkien-level artistic geniuses are able to blend human experiences, history, and their artistic corpus into a slurry to form an entirely new genre. The difference is like finely blending spices to add to the slurry, creating a unique and flavorful dish.

Replies from: vlad.proex
comment by vlad.proex · 2023-04-25T04:07:44.982Z · LW(p) · GW(p)

Nice! Last weekend I expanded https://www.gptrim.com/ to allow the user to a) see savings in both characters and tokens; b) determine their own combination of word processing functions. Then I saw, like you said, that to save tokens you only want to remove stopwords. I will next add the option to remove punctuation. I also want to give users two general recipes: optimize for saving tokens vs. optimize for saving characters. Always happy to take more ideas.

I will probably write again on this, on my new personal Substack or other websites, reporting what I've learned. Would you like me to cite you and link to your profile? My DMs are open! 

P.S: Due to my speedrunning the coding, the website now has an issue where it sometimes adds spaces to the text. I am aware of this and will fix it latest next weekend. The Python package that runs all this is accepting PRs: https://github.com/vlad-ds/gptrim. 

comment by ChristianKl · 2023-04-22T12:28:02.986Z · LW(p) · GW(p)

Applying the standard NLP preprocessing routine to GPT prompts is a quick and free way to reduce character count and save some tokens while seemingly preserving most information. 

I doubt that this is free. OpenAI has its own tokenizer. You basically are saying that without understanding the tradeoffs that the current tokenizer makes adding another tokenizer in front of it will improve performance for free.

Replies from: vlad.proex
comment by vlad.proex · 2023-04-23T14:34:13.779Z · LW(p) · GW(p)

I see your point. I think the existing tokenizer is designed to keep all parts of text, while the idea here is to sacrifice some information in favor of compression. But writing this, I also realized that this approach is more effective at saving characters than tokens.