Evaluating Attribution Methods using White-Box LSTMs

Yiding Hao

Evaluating Attribution Methods using White-Box LSTMs

2020

Yiding Hao

Interpretability methods for neural networks are difficult to evaluate because we do not understand the black-box models typically used to test them. This paper proposes a framework in which interpretability methods are evaluated using manually constructed networks, which we call white-box networks, whose behavior is understood a priori. We evaluate five methods for producing attribution heatmaps by applying them to white-box LSTM classifiers for tasks based on formal languages. Although our white-box classifiers solve their tasks perfectly and transparently, we find that all five attribution methods fail to produce the expected model explanations.

Keywords:

Artificial neural network
A priori and a posteriori
Machine learning
Interpretability
Formal language
White box
Artificial intelligence
Computer science
Attribution

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations