Meta's new Megabyte system solves one of GPT's biggest hurdles

3.8

★

322 User Rating

Meta AI recently published preprint research demonstrating a new "megabyte" framework for building generative pre-trained transformer (GPT) systems. Called "promising" by OpenAI's Andrej Karpathy, a former head of artificial intelligence at Tesla, the new architecture is designed to process large amounts of data -- such as images, novels and video files -- without using a process called tokenization.

Tokenization is a lossy process comparable to file compression. To handle large amounts of data, GPT models convert bytes into tokens. The tokens are then processed by the converter and used to generate output tokens, which are then decoded. The process of tokenization allows AI systems to process larger strings of data as numbers. For example, if the sentence "My favorite color is red" is processed by OpenAI's ChatGPT, it will be converted into token strings "3666, 4004, 3124, 318, 2266, 13" for processing. Unfortunately, even with tokenization, there are still hard limits to the amount of data that current state-of-the-art systems can process. For GPT-3.5, the limit is a little over 4,000 tokens or about 3,000 words, while The maximum for GPT-4 is about 32,000 tokens or about 24,000 words.

Meta's new Megabyte system forgoes tokenization in favor of a novel multi-layer predictive architecture capable of end-to-end modeling over 1 million bytes of data.

Most standard English language encoding systems use a standard 8-bit encoding. In this example, each character occupies one byte of data. Thus, an AI system that can process 1 million bytes of data without tokenization can process a text document containing 750,000 words— a 3,025% increase over GPT-4. By comparison, GPT-4 can currently handle about 10 long-form news articles in a single prompt, and Megabyte will be able to parse the entirety of Leo Tolstoy's War and Peace, as well as two other novels of moderate length. Meta's Megabyte model also performs well on ImageNet tests and benchmarks related to processing audio files, equaling or exceeding existing byte-based converter models such as DeepMind's Perciever AR in both:

"Megabyte matches the state-of-the-art performance of PerceiverAR while using half the computation." The implications of this research could be far-reaching. Tokenization is considered an obstacle in this field due to its hard data constraints and the energy and time required to train the system. Without tokenization, it should be possible to train AI models with stronger underlying support for languages other than English, especially those that cannot easily be encoded in standard 8-bit characters. This could lead to further demo cratization of these technologies and enable everything from cryptocurrency trading bots to decentralized autonomous organization technology to be built in local language code anywhere in the world.

It can also improve the ability of models like ChatGPT to process images, video, and audio files by generating multimedia clips using roughly the same time and energy consumption as text.

AkedoAKE	$0.000653 +244.20%
DeepNodeDN	$0.0923 +41.96%
DODODODO	$0.0281 +41.28%
FC Porto Fan TokenPORTO	$0.5150 +27.48%
SKALE NetworkSKL	$0.005040 +19.15%

AkedoAKE	$0.000654 +244.78%
StellarXLM	$0.1887 +2.22%
Bitcoin CashBCH	$228.400 -4.03%
SandiskSNDK	$1,613.79 -9.69%
SK 海力士美国存托凭证SKHY	$175.980 -4.02%

RobinhoodHOODB	$115.730 -2.17%
BroadcomAVGOB	$395.240 -1.19%
ARMARMB	$274.950 -4.08%
Applied OptoelectronicsAAOIB	$111.810 -11.49%
IBMIBMB	$213.730 -3.84%

Meta's new Megabyte system solves one of GPT's biggest hurdles

Related News

Microsoft cancels Bing waitlist, lets users use GPT-4 for free

Apple's Private GPT AI: No Public Release Yet

Aging ChatGPT: Unveiling the Study Results on Its Abilities

Latest News

Industry

Cryptocurrency

Airdrop

Markets

New SEC Crypto Rule to Cut Red Tape for Startup Fundraising

White House Admits Federal Bitcoin Fund is Still Delayed

Senate Test for Clarity Act Could Spark Crypto Market Volatility

SBI’s $289M Bitbank Deal Signals Japan Crypto Consolidation

Invesco Files for Tokenized Fund to Back Stablecoin Reserves

Top

Top Gainers

Top Trending

Recently added

Learn