An aviation manufacturer wanted an online repository to manage PDF and Image format documents. The fundamental function of the application was to extract data from the documents using OCR (Optical Character Recognition) and transfer it into the database.
Following were the major requirements:
- The application should be able to save the document in image format
- The application users should be able to search for documents through complete search text
- Each document should have a watermark of the company
- Document categories should be determined and prioritized by OCR
- OCR should be at least 80% accurate
The Challenge
One of the main challenges was to use Tesseract OCR on documents, which sadly were not of good quality. The images had background noise, poor resolution and alignment issues which effected the OCR accuracy.
Due to these reasons, Tesseract OCR provided an accuracy of 30% to 40%.
Solution
After some thorough research and consideration, vteams engineers Rizwan Mehdi and Faran Ahmed first opted to use ImageMagick with OCR to improve accuracy of the extracted data. ImageMagick was also used to remove background noise and improve the overall quality of images.
Next, keeping the quality of the images in mind, the team trained the OCR with Training Data. JtessBoxEditor and VietOCR3 were used to sample the data and train OCR according to the available images.
Conclusion
The main objective of the application was to achieve an accuracy of at least 80% with OCR. However, the solutions used by vteams to counter this challenge allowed OCR to extract data from images with 95% accuracy.