Interpreting and Controlling Linguistic Features in Neural Networks’ Representations
Speaker:
Tomasz Limisiewicz (ÚFAL MFF UK)
Abstract:
Neural networks have achieved state-of-the-art results in a variety of tasks in natural language processing. Nevertheless, neural models are black boxes; we do not understand the mechanisms behind their successes. I will present the tools and methodologies used to interpret black box models. The talk will primarily focus on the representations of Transformer-based language models and our novel method — orthogonal probe, which offers good insight into the network's hidden states. The results show that specific linguistic signals are encoded distinctly in the Transformer. Therefore, we can effectively separate their representations. Additionally, we demonstrate that our findings generalize to multiple diverse languages. Identifying specific information encoded in the network allows removing unwanted biases from the representation. Such an intervention increases system reliability for high-stakes applications.