Literature Review of Text AutoEncoders

post by NickyP (Nicky) · 2025-02-19T21:54:14.905Z · LW · GW · 0 comments

Contents

    Introduction
    Architecture & General Training Strategy for Single-Vector Text Autoencoders
  Some Main Papers
    Smaller-scale experiments:
      1) AUTOBOT: Sentence Bottleneck AutoEncoders from Transformer Language Models (Montero et al., 2021)
      2) Vec2Text: Text Embeddings Reveal (Almost) As Much As Text (Morris et al. 2023)
      3) Semantic Overlap Summarization using Sentence Autoencoders (Bansal et al., 2023)
      4) Contra Bottleneck T5 (Lee. 2023)
    Larger-scale Experiments
      5) Vec2Text with Round-Trip Translations (Cideron et al. 2022) (Google Brain)
      6) SONAR: Sentence-Level Multimodal and Language-Agnostic Representations (Duquenne, et al. 2023) (Meta)
    Other work I did not read deeply.
      Bag-of-Vectors Autoencoders (Mai et al. 2021.)
      SemFormers: Language Models with Semantic Planning (Yin et. al 2024)
      List of other papers
    Brief Example of Use
  Conclusion
None
No comments

This is a brief literature review of Text AutoEncoders, as I used them in a recent project and did not find a good resource covering them.

TL;DR: There exist models that take some text -> encode it into a single vector -> decode back into approximately the same text. Meta's SONAR models seem to be the best at the moment for this.

Introduction

Text AutoEncoders are a simple approach: you train a model to encode an entire input sequence (e.g. a sentence) into a latent representation, and decode that representation to reconstruct the original text. 

Most literature is interested in "text-embed models" or "sentence transformers", which embed a text into a single vector for things like categorization and retrieval-augmented-generation. These have their uses, and take most of the attention, but typically are encoder-only trained to have "high similarity between similar texts".

Instead, I try to focus specifically on models which have a decoder. In particular, these compress the entire input into one fixed-size vector (the “bottleneck”). This single-vector representation still be used for comparison, and clustering, but the main benefit of them is that they have a reconstruction if you use their decoder.

Below, I first recap the architecture and training for single-vector text auto-encoders, then highlight the relevant papers that use them. I then briefly discuss some related works that do not rely on a single-vector bottleneck but are still relevant. I also give a short example of a couple of auto decoded paragraphs.


Architecture & General Training Strategy for Single-Vector Text Autoencoders

Here focus on Single-vector Text AutoEncoders, also sometimes called Text Bottleneck AutoEncoders, Sentence Bottleneck AutoEncoders or just AutoEncoders.

Illustration of typical "Text AutoEncoder" model. Typically uses Cross-Entropy Loss as loss from the output as the "AutoEncoder Loss". De-noise AutoEncoder Loss also masks/randomizes some of the input tokens to get more robustness.  
  1. Encoder
    • Typically either a recurrent network (RNN, LSTM) or a Transformer encoder.
    • The entire input sequence is processed into a single hidden or pooled representation, e.g. by taking the final hidden state, or pooling (like [CLS] approach, or average pooling).
    • If you’re using a large pretrained model (e.g. T5, RoBERTa), you might freeze it or partially fine-tune it.
  2. Bottleneck
    • This is literally the single vector. Sometimes it’s the last hidden state. Sometimes a small linear or MLP projection is used to reduce or fix dimension. Typically in the 256–1024 dimension range.
      • There are variations where this can be multiple vectors.
  3. Decoder
    • Autoregressive or seq2seq generation. RNN or Transformer.
    • Takes the single vector and conditions on it at each decoding step. Usually, cross-attention is replaced with “this single embedding is repeated” or used as the initial hidden state.
  4. Training Loss
    • Negative log-likelihood or cross-entropy to reconstruct the original text from that single vector. Sometimes they add a denoising approach or specialized corruption. Sometimes they add additional objectives such as MSE loss between the same sentence in different languages, amongst other things

Below are major examples.


Some Main Papers

Here are some of the main papers that implement text auto-encoders. The first three are some smaller scale experiments, and the latter two are larger-scale one. The largest training run is the SONAR one, with the longest context lengths is the SONAR paper, using ~100B tokens and allowing context sizes of up to 512 tokens. 

 

Smaller-scale experiments:

1) AUTOBOT: Sentence Bottleneck AutoEncoders from Transformer Language Models (Montero et al., 2021)

Aim: I think the first main transformers-based text auto-encoder. Seems to work well as a semantic-embedding model compared to other unsupervised approaches at the time.

Architecture Encoder: Frozen RoBERTa (base: 125M params, large also tested). A learned multi-head attention bottleneck pools final hidden states into a single 768-dim sentence vector.

Architecture Decoder: Single-layer Transformer decoder (hidden size 768, multi-head attention). Uses cross-attention on the single bottleneck vector , but add an additional gating mechanism by replacing the pre-output  with:   

Training: De-noising AutoEncoding (masked token reconstruction, cross-entropy loss). Uses BooksCorpus + Wikipedia (~2.5B tokens).

Models not available to download, but the fine-tuning code is on GitHub.

 

2) Vec2Text: Text Embeddings Reveal (Almost) As Much As Text (Morris et al. 2023)

Aim: Show that dense text embeddings from a black-box encoder can be iteratively “inverted” (like model-inversion attack [LW · GW] or "feature visualization") to reconstruct the original text. 

Achieves good recovery of OpenAI text-embeddings-ada-002 for 32-token inputs (60% exact match, 83.4 BLEU) and OK performance up to 128 tokens (8% exact match, 55 BLEU). Showing that embeddings are quite revealing.

Architectures:

Training: Standard cross-entropy modelling loss. They train on MS MARCO, ~350M tokens total (32 or 128 tokens) embedded with OpenAI’s text-embedding-ada-002.

Models available on HuggingFace, but should access via their GitHub repo. Not be confused with the other model below called vec2text

 

3) Semantic Overlap Summarization using Sentence Autoencoders (Bansal et al., 2023)

Aim: Generate a single sentence capturing common overlapping information from two input sentences, using a small, plug-and-play autoencoder-based model.

Architectures:

Training:

  1. Autoencoder Pre-training: Denoising autoencoding on unlabeled text, optimizing ROUGE-based reconstruction.
  2. SOS Operator: Learns to produce overlap embeddings, aided by an adversarial term ensuring outputs lie within the autoencoder’s manifold.
  3. Data: Synthetic sentence-level pairs derived from CNN/DailyMail with ChatGPT to generate partial overlaps. Does not say how many tokens.

No models are available to download.

 

4) Contra Bottleneck T5 (Lee. 2023)

Aim: A random set of models only on HuggingFace Hub and a Google Colab, by Linus Lee 2023. Used to show how T5 can do auto-encoding with a single vector.

Architecture: They modify T5’s standard encoder–decoder by compressing the encoder’s final output states into a single vector (by e.g. average pooling or special aggregator token). The decoder is the normal T5 decoder but cross-attention sees just that single vector.

Training: Cross-entropy on a subset of Wikipedia, not sure how many tokens.

 

Larger-scale Experiments

5) Vec2Text with Round-Trip Translations (Cideron et al. 2022) (Google Brain)

Aim: A single-vector Transformer-based autoencoder for sentence representations, trained to reconstruct text while preserving semantics. Yields a universal single-vector AE that does well on "controllability" and “embedding geometry.”

Architecture Encoder: Various, including one with T5-base (250M params, pretrained). Mean-pooling over encoder hidden states reduces the sequence into a single d-dimensional bottleneck vector (d ∈ {16, 64, 128, …, 512}).

Architecture Decoder: Various, including T5-base decoder, receiving only the single bottleneck vector as cross-attention memory. Includes a gating mechanism to handle the compressed input.

Training: Auto-encoding objective with cross-entropy loss on C4-derived dataset (10B tokens), plus machine-translated paraphrases. Three variants for Loss: 

No models are available to download. Not be confused with the other model above called vec2text

 

6) SONAR: Sentence-Level Multimodal and Language-Agnostic Representations (Duquenne, et al. 2023) (Meta)

Aim: A multilingual and multi-modal sentence embedding model that encodes text and speech into a single fixed-size vector for cross-lingual and cross-modal understanding.

Architecture Encoder:

Architecture Decoder: 24-layer Transformer decoder (also from NLLB 1B), modified to cross-attend only to the bottleneck vector rather than per-token encoder states. An additional fine-tuning stage optimizes the decoder for better reconstruction and generation quality.

Training: Trained by Meta, so uses a large-scale multilingual data from NLLB (parallel corpora, backtranslations, mined text). Takes the original NLLB-1B model an fine-tunes on approx 100B tokens[1], the most out of any here, and has the largest text window, 512 tokens. Four training objectives:

Available to download on HuggingFace Hub.


Other work I did not read deeply.

Some other sources that are related:

Bag-of-Vectors Autoencoders (Mai et al. 2021.)

SemFormers: Language Models with Semantic Planning (Yin et. al 2024)

 

List of other papers

Below are some even more brief mentions of some papers that I did not read deeply:[2]

 

Brief Example of Use

SONAR is a model from August 2023, trained as a semantic text auto-encoder, converting text into semantic embed vectors, which can later be then decoded back into text. Additionally, the model is trained such that the semantic embed vectors are to some degree "universal" for different languages, and one can embed in French and decode in English.

I tried it, and SONAR seems to work surprisingly well. For example, the above paragraph and this paragraphs, if each are encoded into two 1024 dimensional vectors (one for each paragraph), the model returns the following decoded outputs:

SONAR is a model from August 2023, which is trained as a semantic text auto-encoder, converting text vectors into semantic embedded vectors, which can then be decoded back to text. In addition, the model is trained in such a way that semantic embedded vectors are somehow "universal" for different languages, and can be decoded into French and encoded into English.

 

I tried it, and it seems SONAR works surprisingly well. For example, the above paragraph and these paragraphs, if each is encoded in two 1024 dimensional vectors (one for each paragraph), the model returns the following decoded outputs.

Here is an example of the same paragraph encoded then decoded by all three models: SONAR, T5 Bottleneck, and Text2Vec respectively. The first two use dimension 1024, while the third uses dimension 1536.

SONAR:

Here is an example of the same paragraph coded and then decoded by all three models: SONAR, T5 bottleneck, and Text2Vec respectively. The first two use dimension 1024, while the third uses dimension 1536.

T5 Bottleneck:

This is an example of a decoded paragraph followed by the same three characters. Then the encoders: SONAR2, TomatoBox, and Text4Loaded use each dimension: the first uses 1025 pixels, while the second uses dimension 536.

Text2Vec (the "reveal as much as text" one, zero steps):

Here is an example of the same three models implemented in the first paragraph: the first one encoded to 1024 bytes, the second to 1532 bytes, and the third to 1536 bytes. Note that the dimension is the same for TENSEX, SON and VEX.

Most of these would be OK for my purposes, but SONAR seems to be the best.

Conclusion

Overall, there are definitely a range of papers implementing various single-vector Text AutoEncoders, but the largest one at the moment seems to be the SONAR model by Meta, being the largest training run. It may be worth testing the others better, but for my purposes I think the SONAR AutoEncoder model seems fine.

It seems plausible that the RNN or LSTM approaches might also work relatively well, considering the training objective is relatively well suited to them, but I have not deeply read the literature about them. There are likely some useful papers that I missed, but if you plan to use text auto-encoders I hope that this reference is useful.

  1. ^

    In the SONAR paper, they say "We trained our encoder-decoder model for 100k updates with same learning rate and batch size as NLLB training". I checked NLLB paper and I think they say batch size 1M tokens but they have many numbers floating around. Thus 100k times 1M = 100B Tokens

  2. ^

    I avoided going too deep on non-transformer based text auto-encoders. This is not particularly principled, it might make perfect sense to have RNN or LSTM based text auto-encoders, but I don't have the time

0 comments

Comments sorted by top scores.