FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

This work was done in collaboration with Swiss Federal Institute of Technology Lausanne (EPFL).

Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image’s inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256×256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID < 2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine “visual vocabulary,” and that the number of tokens to generate depends on the complexity of the generation task.

Figure 1: Comparison of partial sequence generation: Raster-scan 2D-grid tokenizer vs. FlexTok. FlexTok resamples images into a 1D sequence of discrete tokens of flexible length, describing images in a coarse-to-fine manner. When training autoregressive (AR) models on FlexTok token sequences, the class conditioning (here “golden retriever”) can be satisfied by generating as few as 8 tokens, whereas AR models trained on 2D tokenizer grids need to always generate all tokens, no matter the complexity of the condition or image.

*Equal contribution.
† Jointly affiliated with Apple and Swiss Federal Institute of Technology Lausanne (EPFL).
‡ Swiss Federal Institute of Technology Lausanne (EPFL).

Source link

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Leave a Reply Cancel reply

About

Categories

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Related posts

Meta AI Releases 'NATURAL REASONING': A Multi-Domain Dataset with 2.8 Million Questions To Enhance LLMs’ Reasoning Capabilities

Google DeepMind Research Releases SigLIP2: A Family of New Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

SGLang: An Open-Source Inference Engine Transforming LLM Deployment through CPU Scheduling, Cache-Aware Load Balancing, and Rapid Structured Output Generation

Leave a Reply Cancel reply

About

Categories