Skip to Main Content

Digitize Your Sources (DIY)

A library guide to DIY tools and resources for digitizing sources in preparation for digital scholarship.

Make PDFs Searchable with OCR

OCR (Optical Character Recognition) is a technology that enables you to convert different types of textual documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data.

Why is this helpful?

When OCR converts a scanned image into digital text, it becomes a searchable text file, thus making navigation much easier.

You could quickly find the page number of a favorite quote or locate all the uses of certain keyword within a large document, for example. 

You can also copy, paste, and edit passages of text within the document.

Documents can be converted to digital text using OCR software such as Adobe Acrobat Pro DC.

Best Practices and Limitations

Scan Setting Recommendations for OCR:

  • 300 dpi resolution is generally recommended for accuracy. If the font size is below 10pts then 400 dpi would be recommended.
  • Grayscale is recommended over B/W because it will keep more details. If your document has color images then you should scan in color mode.
  • PDF, TIFF and PNG are recommended for uncompressed file formats. JPEGs will lose quality with each edit and save.
  • A medium brightness of 50% is suitable for most scans. Brightness that is too high or low can negatively affect accuracy.
  • Straightness of the initial scan can affect OCR quality. Skewed pages can lead to inaccurate recognition.

Some Limitations of OCR:

  • OCR may not convert characters with very large or very small font sizes.
  • OCR works best with good quality typed documents. Handwritten documents cannot be easily read by OCR software.
  • Language: texts published before 1850 may not be the most compatible with OCR software.
  • Noise – speckles, streaks, watermarks, stamps, and other marks that are not part of the text can interfere with OCR. This can include images with handwritten notes, circled text, and other notations, which are done to document prior to scanning sometimes. 
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.