Meta AI recently published preprint research demonstrating a new "megabyte" framework for building generative pre-trained transformer (GPT) systems. Called "promising" by OpenAI's Andrej Karpathy, a former head of artificial intelligence at Tesla, the new architecture is designed to process large amounts of data -- such as images, novels and video files -- without using a process called tokenization.
Tokenization is a lossy process comparable to file compression. To handle large amounts of data, GPT models convert bytes into tokens. The tokens are then processed by the converter and used to generate output tokens, which are then decoded. The process of tokenization allows AI systems to process larger strings of data as numbers. For example, if the sentence "My favorite color is red" is processed by OpenAI's ChatGPT, it will be converted into token strings "3666, 4004, 3124, 318, 2266, 13" for processing. Unfortunately, even with tokenization, there are still hard limits to the amount of data that current state-of-the-art systems can process. For GPT-3.5, the limit is a little over 4,000 tokens or about 3,000 words, while The maximum for GPT-4 is about 32,000 tokens or about 24,000 words.
Meta's new Megabyte system forgoes tokenization in favor of a novel multi-layer predictive architecture capable of end-to-end modeling over 1 million bytes of data.
Most standard English language encoding systems use a standard 8-bit encoding. In this example, each character occupies one byte of data. Thus, an AI system that can process 1 million bytes of data without tokenization can process a text document containing 750,000 words— a 3,025% increase over GPT-4. By comparison, GPT-4 can currently handle about 10 long-form news articles in a single prompt, and Megabyte will be able to parse the entirety of Leo Tolstoy's War and Peace, as well as two other novels of moderate length. Meta's Megabyte model also performs well on ImageNet tests and benchmarks related to processing audio files, equaling or exceeding existing byte-based converter models such as DeepMind's Perciever AR in both:
"Megabyte matches the state-of-the-art performance of PerceiverAR while using half the computation." The implications of this research could be far-reaching. Tokenization is considered an obstacle in this field due to its hard data constraints and the energy and time required to train the system. Without tokenization, it should be possible to train AI models with stronger underlying support for languages other than English, especially those that cannot easily be encoded in standard 8-bit characters. This could lead to further demo cratization of these technologies and enable everything from cryptocurrency trading bots to decentralized autonomous organization technology to be built in local language code anywhere in the world.
It can also improve the ability of models like ChatGPT to process images, video, and audio files by generating multimedia clips using roughly the same time and energy consumption as text.



















