Tags:confidential information detection, deep learning and word embedding
Abstract:
Confidential information firewalling with text classifiers is to recognize the text containing confidential information whose publication might pose a threat to national security, business trade, or personal life. Word embedding is a component of the detector and plays an important role. Existing word embeddings, e.g., Word2Vec, fail to learn a clear task classification boundary, i.e., the confidential polarities of words are opposite but the embedding vectors of the words are close to each other. We propose a confidentiality-oriented word embedding, CES2Vec, for confidential information detection. We embed confidentiality into semantics to catch both of them together, which can learn the word embedding with a clear task classification boundary. We use real-world data from WikiLeaks and conduct the comparison experiments of our CES2Vec and popular methods. The experimental results show that our proposed method is better than the previously reported methods in detecting confidential information.
CES2Vec: A Confidentiality-Oriented Word Embedding for Confidential Information Detection