Identifying textual personal information using bidirectional LSTM networks

2018 
Data-driven approaches based on the data collected from individuals are improving everyday life as a result of the developments in big data studies. Prior to developing such an approach, removal of personal information from the data is important since personal information contained in data would jeopardize people's privacy and may harm related individuals. Especially in the field of health sciences, identifying personal information in the collected data is a difficult task as most of the data collected in hospitals are in plain text format. In this work, a method for automatically identifying words which includes personal information is proposed. The proposed method uses natural language processing techniques and bi-directional long short term memory networks. Development of the proposed method is done by using a de-identification challenge dataset which is composed of discharge summaries of 889 patients. The proposed method in this study is able to identify words that include personal information from their surrounding words without using dictionaries such as name lists or city lists. The tests at the end of this study show that proposed method can identify words containing personal information with an accuracy of 99.43%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    7
    References
    0
    Citations
    NaN
    KQI
    []