Loading...
Table manners at scale: introducing Lora Lens for efficient analysis & detoxification of language models
Pettyjohn, Jordan
Pettyjohn, Jordan
Citations
Altmetric:
Advisor
Editor
Date
Date Issued
2025
Date Submitted
Collections
Research Projects
Organizational Units
Journal Issue
Embargo Expires
2026-11-11
Abstract
Transformer-based language models have achieved significant advancements across numerous natural language processing (NLP) tasks. However, as these models grow in scale and complexity, ensuring interpretability and mitigating toxic outputs become increasingly critical challenges. This thesis addresses these issues by first analyzing the role attention heads play in propagating toxicity within models, leveraging a recently proposed interpretability tool known as Attention Lens. By decoding attention head outputs into human-interpretable tokens, we identify specific attention heads contributing disproportionately to toxic content generation and demonstrate that targeted interventions at the head level significantly reduce toxicity without requiring complete model retraining.
Despite its effectiveness, Attention Lens faces severe limitations in scalability and computational efficiency. To overcome these limitations, we propose and implement the Lora Lens, an innovative adaptation of Attention Lens employing Low-Rank Adaptation (LoRA) to drastically reduce memory footprint and computational cost. Specifically designed for compatibility with large-scale models such as Llama 3 8B, Lora Lens integrates seamlessly with HuggingFace's transformers library and Microsoft's DeepSpeed framework, enabling efficient distributed training. Our results demonstrate that Lora Lens maintains the interpretative capabilities of Attention Lens while significantly enhancing its efficiency and scalability, allowing practical deployment on models with billions of parameters.
Ultimately, this work contributes a practical, scalable interpretability technique, enabling researchers and practitioners to better understand, evaluate, and safely deploy large transformer models.
Associated Publications
Rights
Copyright of the original work is retained by the author.
