Data Loss Prevention Based on Text Classification in Controlled Environments

Kyrre Wahl Kongsgård,Nils Agne Nordbotten,Federico Mancini,Paal E. Engelstad

Data Loss Prevention Based on Text Classification in Controlled Environments

2016

Loss of sensitive data is a common problem with potentially severe consequences. By categorizing documents according to their sensitivity, security controls can be performed based on this classification. However, errors in the classification process may effectively result in information leakage. While automated classification techniques can be used to mitigate this risk, little work has been done to evaluate the effectiveness of such techniques when sensitive content has been transformed (e.g., a document can be summarized, rewritten, or have paragraphs copy-pasted into a new one). To better handle these more difficult data leaks, this paper proposes the use of controlled environments to detect misclassification. By monitoring the incoming information flow, the documents imported into a controlled environment can be used to better determine the sensitivity of the document(s) created within the same environment. Our evaluation results show that this approach, using techniques from machine learning and information retrieval, provides improved detection of incorrectly classified documents that have been subject to more complex data transformations.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations