Free · Tesseract OCR · 100% In-Browser

Extract Text from PDF with OCR
Scanned Documents, Unlocked

Recognise and extract text from scanned PDFs and image-based documents using Tesseract OCR -- the world's leading open-source OCR engine. Supports 100+ languages. Nothing is uploaded to any server.

100+Languages
TXT / DOCXOutput Formats
0 KBData Sent to Server
FreeAlways

PDF OCR Text Extractor

Upload a scanned PDF -- choose language -- extract text instantly

Drop your scanned PDF here

or click to browse from your device

Scanned PDFs Image PDFs Handwritten Docs Forms 100+ Languages
filename.pdf
0 KB
Initializing Tesseract OCR...

Extracted Text

Simple Process

PDF OCR in Three Steps

1

Upload Your PDF

Drop any scanned PDF, image-based PDF or digital PDF with locked text. PDF.js renders each page to a canvas at your chosen quality, preparing it for Tesseract OCR processing.

2

Choose Language and Range

Select the OCR language matching your document -- 20 languages available including English, Chinese, Arabic, Russian and Hindi. Choose render quality and which pages to process.

3

Extract and Download

Tesseract OCR processes each page individually. A live preview shows extracted text as pages complete. Download as plain .txt or structured Word .docx when done.

Why Choose Us

OCR That Runs in Your Browser

Powered by Tesseract.js -- the JavaScript port of the industry-leading Tesseract OCR engine originally developed at HP Labs and now maintained by Google. Runs entirely in your browser.

100% Private

Tesseract.js runs the entire OCR neural network inside your browser. Your PDF pages -- whether they contain confidential contracts, medical records, financial statements or personal documents -- never leave your device.

Tesseract OCR Engine

Tesseract is the world's most widely used open-source OCR engine, originally developed at HP Labs (1985), later acquired and released as open-source by Google (2005), and now maintained by the open-source community. Tesseract.js brings this battle-tested engine to the browser via WebAssembly.

20+ Languages

Recognise text in English, Spanish, French, German, Italian, Portuguese, Russian, Arabic, Chinese (Simplified and Traditional), Japanese, Korean, Hindi, Turkish, Polish, Dutch and more. Language data downloads automatically on first use.

Per-Page Progress

Each page shows as a status chip -- pending, active (currently processing) or complete. The progress bar and status message update in real time so you know exactly which page Tesseract is processing and how much remains.

TXT and DOCX Output

Download extracted text as .txt for maximum compatibility with any text editor, database or content system. Or download as a structured .docx Word document with each page's text in a separate section, ready to edit in Microsoft Word or Google Docs.

Adjustable Render Quality

OCR accuracy is directly proportional to image resolution. Render pages at 1.5x (faster, ~108 DPI) for quick extraction of clean printed text, or 3x (slower, ~216 DPI) for maximum accuracy on faded, small-font or handwritten documents.

What Is OCR and Why Does It Matter?

OCR (Optical Character Recognition) is the process of recognising and converting text in images into machine-readable characters. When you scan a physical document with a scanner, flatbed printer or smartphone camera, the result is a raster image -- a grid of pixels that looks like text to the human eye but contains no actual character codes that a computer can read, search or edit. OCR analyses the pixel patterns and identifies which characters they represent, producing a text string that can be stored, searched, copied and processed computationally.

PDF documents fall into two fundamental categories that determine whether OCR is needed. Text-based PDFs (created digitally by exporting from Word, LaTeX, accounting software or any application that generates PDF directly) contain actual text objects with Unicode character codes -- you can select, copy and search the text natively without OCR. Image-based PDFs (created by scanning physical documents, or by combining images into a PDF container) contain page images with no underlying text objects -- they appear visually identical to text-based PDFs but are actually photographs of text pages. OCR is required to extract text from image-based PDFs.

"A scanned PDF is a photograph of text. OCR is the process of teaching a computer to read that photograph -- a task that modern neural networks perform with remarkable accuracy on standard printed documents."

When to Use PDF OCR

Scanned Physical Documents

Physical documents that exist only on paper -- historical records, old contracts, handwritten notes, paper invoices, physical forms, printed reports -- must be scanned to create a digital copy. The scanned result is an image-based PDF with no searchable text. OCR converts these image PDFs into searchable, editable text documents, enabling full-text search, content analysis, database integration and programmatic processing.

Locked or Secured PDFs

Some PDFs have text content but restrict text extraction through PDF security settings (content restrictions, DRM, or encryption flags). When these restrictions prevent copying or selecting text, rendering the PDF pages to images and running OCR on those images provides an alternative path to accessing the text content for legitimate purposes such as accessibility, translation and archival.

Legacy Document Digitisation

Organisations digitising large archives of paper records -- legal files, patient records, HR documents, engineering drawings, government correspondence -- use OCR as the critical step that transforms scanned image PDFs into searchable, indexable content that can be integrated with document management systems (SharePoint, Documentum, OpenText), eDiscovery platforms and enterprise search systems.

Academic and Research Text Extraction

Researchers conducting systematic literature reviews, historians analysing archival documents, linguists building text corpora and data scientists creating training datasets frequently need to extract text from large volumes of scanned PDFs. Running OCR on these documents -- academic papers scanned from print journals, historical newspapers, legal case files, government reports -- produces the machine-readable text required for quantitative analysis, natural language processing and computational research methods.

Understanding Tesseract OCR

Tesseract is the world's most deployed open-source OCR engine, with a history spanning four decades. It was originally developed at HP Labs Bristol between 1984 and 1994, commercialised briefly, then released as open-source software by HP in 2005. Google took over maintenance in 2006 and has significantly enhanced it since then, adding LSTM (Long Short-Term Memory) neural network-based recognition in version 4 (2018) and version 5 (2021), dramatically improving accuracy on challenging documents.

Tesseract.js (used by our converter) is a pure JavaScript/WebAssembly port of the Tesseract 5 engine. It compiles the entire Tesseract C++ codebase to WebAssembly, allowing the same OCR model that powers enterprise document processing pipelines to run directly in your browser without any server-side processing. The accuracy of Tesseract.js on clean, well-scanned documents in major languages (English, Western European languages, Chinese, Japanese, Korean, Arabic) typically exceeds 98%.

Factors That Affect OCR Accuracy

Several factors influence how accurately Tesseract can recognise text from your scanned PDF:

  • Scan resolution: The most important factor. Documents scanned at 300 DPI or higher produce significantly better OCR results than 150 DPI scans. Our 3x render scale (approximately 216 DPI from the canvas) produces the best accuracy for challenging documents.
  • Image quality: Clean, high-contrast scans with sharp text recognition dramatically outperform faded, low-contrast or blurry scans. Scanning in greyscale rather than colour can improve recognition of light text on patterned backgrounds.
  • Font type: Printed serif and sans-serif fonts are recognised with very high accuracy. Handwriting, decorative display fonts and scripts used in historical documents produce lower accuracy and may require manual correction after OCR.
  • Page orientation: Tesseract requires pages to be correctly oriented (text reading left to right for Latin scripts, right to left for Arabic/Hebrew). Rotated scans -- a common problem with physical scanning -- significantly reduce accuracy. Deskewing the scan before OCR processing improves results.
  • Language selection: Selecting the correct language for your document is critical. Tesseract uses language-specific character sets, linguistic models and word frequency tables to improve recognition accuracy. Using "English" for a French document will produce recognisable but lower-accuracy results.

Tips for Best OCR Results

  • Use 3x scale for difficult documents: Faded ink, small fonts (below 10pt in the original scan), handwriting and low-quality scans all benefit significantly from maximum render resolution. The extra processing time is worth the accuracy improvement.
  • Select the correct language: Each language model downloads from the Tesseract project CDN on first use (typically 3 to 30 MB depending on the language). English is the default and smallest download. Chinese, Japanese and Korean language models are larger (30+ MB) and take longer to download on first use.
  • Use custom range for large documents: OCR is computationally intensive -- processing a 200-page document at 3x scale can take 20 to 40 minutes on a typical laptop. Extract only the pages you need using the custom range option to save significant time.
  • Always proof-read the output: Even at 98% character accuracy, a 500-word page may contain 10 or more recognition errors. Common OCR errors include confusing lowercase l and the number 1, confusing O and 0, and missing spaces between words. The extracted text typically needs light editing before use in professional contexts.
Got Questions?

Frequently Asked Questions

Is my PDF uploaded to your server?
No. PDF.js renders your PDF pages in your browser and Tesseract.js performs OCR recognition entirely inside your browser using WebAssembly. No page images, no PDF content and no extracted text is ever transmitted to any server. Your document never leaves your device.
Why is OCR slow? How long will it take?
OCR is computationally intensive -- Tesseract.js runs a full LSTM neural network on each page image inside your browser using WebAssembly. A typical page takes 3 to 15 seconds depending on your device's CPU speed, the render scale and the language. On the first run, Tesseract also downloads the language data file (3 to 30 MB) which adds additional time. Use custom page range to process only the pages you need rather than the entire document.
Does it work on text-based PDFs as well as scanned ones?
Yes, but for text-based PDFs (where text is already selectable), our PDF to Word converter is faster and more accurate because it reads text directly from the PDF content stream rather than running OCR on a rendered image. Use this OCR tool for scanned PDFs, image-based PDFs and documents where direct text extraction fails or produces garbled results.
How accurate is the OCR?
Tesseract 5 (used by Tesseract.js) achieves over 98% character accuracy on clean, well-scanned documents in major languages at adequate resolution (300 DPI or equivalent). Accuracy decreases for handwriting (typically 85 to 95%), historical typefaces, faded or damaged documents, complex layouts with multiple columns, and non-standard fonts. Always proof-read OCR output before using it in professional contexts.
What render scale should I use?
2x (144 DPI equivalent) is the recommended default for most standard scanned documents with legible printed text. Use 3x (216 DPI) for documents with small font sizes below 10pt, faded ink, handwriting, or any document where 2x produces unsatisfactory results. Use 1.5x only when processing speed is critical and document quality is high.
Why does my first OCR run take longer?
On the first run for each language, Tesseract.js downloads the language training data from the Tesseract CDN. English is about 4 MB, major Western European languages are 3 to 10 MB each, and Chinese/Japanese/Korean are 20 to 30+ MB. After the first download, the language data is cached in your browser and subsequent runs start immediately without downloading again.
Can I OCR a multi-column document?
Tesseract has automatic page segmentation that attempts to detect columns, headers, footers and paragraphs. It generally handles two-column academic paper layouts well. Complex magazine layouts with many columns, text wrapped around images and tables may produce text that reads across columns rather than down each column. For multi-column documents, manual reordering of the extracted text may be needed.

Ready to Extract Text from Your PDF?

Drop your scanned PDF above. Tesseract OCR runs entirely in your browser -- free and private forever.

Start OCR Now
More Tools

Related PDF and Image Conversion Tools