The enormous increase in the training data needed by Large Language Models, along with their exceptional model capability, has allowed them to accomplish outstanding language understanding and generation advancements. The efficiency of large language model LLM training is a major topic because scaling up significantly increases computing expenses. It is still very difficult to lower training costs without lowering model performance, even if there are continuous attempts at efficient LLMs. Next token prediction, or predicting the sequence’s next token, is the standard method for training LLMs. Although this method has been quite effective, it could be more efficient for LLMs to learn since the entire model has to examine each token separately. The proposed patch-level training method offers a potential solution to this challenge, promising to reduce training costs and improve efficiency without compromising model performance.
To make LLM training more efficient, researchers from Pattern Recognition Center, WeChat AI, Tencent Inc. present patch-level training in their study. The main idea behind the proposed method is to compress many tokens into one patch to shorten the sequence. This method is based on transfer learning, which helps to minimize training costs by transferring information from a model with a lower training cost (patch level) to a model with a higher training cost (token level). The proposed method differs from other efforts that rely on patch-level functionality; it does not necessitate the final model being patch-level. On the contrary, it facilitates efficient patch-level information acquisition during model training.
The proposed approach involves two main components: training at the patch level and training at the token level. During patch-level training, the language model is trained to predict the next patch by analyzing shorter sequences of patches. This allows for the analysis of most training data with significantly lower computing costs. The token-level model is then initialized with the parameters obtained from the patch-level training. It proceeds to train using the remaining data, applying the information learned from the patch level to the token level. In essence, patch-level training focuses on predicting groups of tokens, or ‘patches ‘, while token-level training focuses on predicting individual tokens.
The new method increases training efficiency by forecasting all tokens in the upcoming patch at the same time. The researchers aim to improve training efficiency by reducing sequence length during training using multi-token prediction, which is the main difference between this approach and these works. They use a single brain for multi-token prediction and refrain from adding further parameters to the model.
Since it doesn’t call for tailored model architectures or meticulously developed model mapping techniques, patch-level training is more adaptable and applicable than model growth. This adaptability gives the confidence that there is a possibility for their combined use because patch-level training is orthogonal to model growth.
After initialization with patch-level training, the model undergoes a quick reduction in loss and continues to undergo token-level training on the remaining data. It delivers an even lower loss and cuts training expenditures in half compared to starting from scratch. This efficient process gives reassurance that even higher acceleration rates can be accomplished by modifying the hyperparameter settings, with a minor cost in model performance.
According to experimental results, this technique can reduce LLM training expenses by 50% while keeping performance comparable. This is just the beginning of the exploration into patch-level training, and it holds much promise. Further developments in determining an empirical scaling rule for patch-level training and testing the scalability of patch-level training on bigger models and datasets could significantly improve this approach, offering even more benefits.
Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 46k+ ML SubReddit
Find Upcoming AI Webinars here
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.