Optical Character Recognition for a Redaction System Using Machine Learning Techniques.

EasyChair Preprint no. 3495

12 pagesDate: May 28, 2020


This paper presents the use of OCR in an automatic Redaction System. A Redactor is a system which takes in any electronic document as an input from the user and identifies sensitive information, mainly nouns, such as: Person name, country name, gender, credit card information, phone numbers, email id, any confidential information that is to be not shown to the end user who the document is to be sent to. Initially, the user inputs a document, probably an image. This image is then pre-processed and put into the OCR which extracts the text out of the image. Hence, to be able to identify the sensitive information the very first step is to extract the information. A major application of an OCR is Redaction. Reading of information present in the documents can be read with the help of an OCR Machine.

Keyphrases: machine learning, Named Entity Recognition, Natural Language Processing, Optical Character Recognition

