In natural language processing (NLP), a central question is how well the probabilities generated by language models (LMs) align with human linguistic behavior. This alignment is often assessed by comparing LM scores with human acceptability judgments, which evaluate how natural a sentence feels. Previous studies, such as those using SLOR (Syntactic Log-Odds Ratio), have attempted to bridge this gap, but significant issues remain. SLOR assumes uniform correction for factors such as sequence length and unigram frequency across different models, which can lead to inaccuracies. A more dynamic method is needed, one that can better adapt to differences between models and the complexities of human language processing.
MORCELA: A New Linking Theory
A team of researchers from NYU and CMU propose MORCELA (Magnitude-Optimized Regression for Controlling Effects on Linguistic Acceptability), which introduces a new linking theory that addresses these challenges. Unlike SLOR, which applies static adjustments for length and unigram frequency, MORCELA estimates the optimal level of adjustment from data, using learned parameters specific to these effects. By incorporating parameters—β for unigram frequency and γ for sentence length—MORCELA adjusts the LM scores, resulting in improved correlation with human judgments. This approach better accounts for how LMs perceive the rarity of words and the length of sentences compared to human expectations. The core idea behind MORCELA is that not all language models should receive the same correction, as models differ in how well they predict linguistic acceptability.
Technical Overview
MORCELA works by incorporating parameters that are trained on human acceptability judgments. These parameters control the extent of correction applied to LM log probabilities, making MORCELA more adaptable than its predecessors like SLOR. Specifically, the learned parameter β adjusts the impact of unigram frequency, while γ controls the correction for sentence length. The flexibility of these adjustments allows MORCELA to better match human acceptability ratings, especially for larger models. For example, larger models, which tend to have a more nuanced understanding of language, often require less adjustment for unigram frequency due to their improved ability to predict less common words in context.
Performance and Significance
The significance of MORCELA becomes evident when considering its performance across different LM sizes. MORCELA outperformed SLOR in predicting human acceptability judgments for models from two well-known families: Pythia and OPT. Results showed that as models grew larger, MORCELA’s correlation with human judgments improved. The optimal parameter values estimated by MORCELA revealed that larger LMs are more robust to frequency and length effects, requiring less correction. This suggests that larger LMs have a better understanding of linguistic context, allowing them to predict the acceptability of rare words more accurately, thereby reducing the impact of unigram frequency as a confounding factor. MORCELA improved the correlation between LM-generated scores and human judgments by up to 46% compared to SLOR, demonstrating its ability to fine-tune corrections more precisely.
This advancement is important for several reasons. First, it suggests that current LMs may be more capable of reflecting human language processing than previously thought, provided the right corrections are applied. Second, the insights from MORCELA can be valuable in psycholinguistic studies that utilize LMs as proxies for human language comprehension. By providing a more accurate linking theory, MORCELA ensures that LMs are evaluated in a way that aligns more closely with human linguistic intuition. For instance, a key result from MORCELA’s implementation showed that larger LMs had a lower reliance on unigram frequency corrections, indicating that these models have a better grasp of less frequent, context-specific words. This characteristic could significantly impact how we interpret LMs in tasks involving rare or domain-specific language.
Conclusion
MORCELA represents an important development in aligning language models with human acceptability judgments. Using learned parameters to adjust dynamically for length and frequency addresses critical flaws in previous approaches like SLOR. The results show that, with proper adjustment, LMs can better reflect human linguistic intuition, particularly as the models scale in size. Future work could explore further adjustments or new parameters that could bring LMs even closer to human-like language understanding. MORCELA not only enhances the evaluation process for LMs but also provides valuable insights into how these models process language, bridging the gap between machine-generated probabilities and human language behavior.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.
[FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.