Automated Text Extraction from Scanned PDFs with the Mistral OCR API

Apr 11, 2025

4 min read

Automated Text Extraction from Scanned PDFs with the Mistral OCR API — Source: Timon Schneider / Alamy

When working with scanned documents, having direct access to a reliable OCR API can make text extraction faster, more consistent, and easier to integrate into existing workflows. In a previous post, we explored a prompt-based solution using Mistral’s Le Chat interface—which produced clean results but missed certain elements such as footnotes. This post shows how the Mistral OCR API can extract structured text from the same 10–12 page scanned PDF using a few lines of Python code. This approach gives full control over the process, from file handling to saving the final .txt output—and is ideal for anyone looking to automate document processing at scale or with greater precision.

Input file

We used the same input file to compare results with the prompt-based approach: a 10–12 page scanned PDF containing academic content with section headings, body text, and footnotes. This time, however, we processed the file using Mistral’s official OCR API via Python.

Mistral_OCR_sample

Mistral_OCR_sample.pdf

2 MB

The Mistral OCR API does not accept local file uploads directly. Instead, it expects the document to be accessible via a publicly reachable URL, such as one hosted on:

Google Drive (shared with a direct download link),
Dropbox,
Amazon S3,
or any web server providing a .pdf file with the correct MIME type (application/pdf).

For this test, we hosted the file on Google Drive and converted the standard share link into a direct download link using this format:

https://drive.google.com/uc?export=download&id=YOUR_FILE_ID

Installation and Setup

You’ll need an API key that authenticates your requests to use the Mistral OCR API. Here’s how to obtain one:

Create an account on the Mistral platform
Visit https://mistral.ai and sign up using your email address. You’ll need to verify your email before proceeding.
Access the API key section
Once logged in, navigate to your dashboard and look for the API keys section. This is usually under “Settings” or “Developers”.
Generate a new key
Click on “Generate API Key”, and a new private key will be created. You can copy it directly from the interface.
Use it in your code
We recommend storing your API key securely using environment variables (as shown in the code example). Avoid hardcoding it directly into scripts, especially if you share code publicly.

The following script shows the complete workflow for processing a scanned PDF using the Mistral OCR API. It includes package installation, client initialisation, file handling, OCR processing, and saving the result to a plain text file.

The full OCR script shown above is available for download below.

Mistral_OCR_script

Mistral_OCR_script.ipynb

5 KB

To use it, you only need to:

Paste your Mistral API key in the appropriate line, and
Replace the document URL with the direct link to the scanned PDF you wish to process.

Once configured, the script will extract the text from your file and save it as a .txt document with structured output.

Output

The output produced by the Mistral OCR API was accurate and structurally consistent. The main body of the scanned text was successfully extracted, and the overall layout of the document was well preserved. Section headers such as # CHAPTER 2 and ## NORMATIVE METHODOLOGY were clearly identified, while paragraph breaks and indentation were maintained in a readable format.

ocr_output

ocr_output.txt

37 KB

Notably, the API handled footnotes and inline references more effectively than the GenAI-based extraction we previously tested in Mistral Le Chat. Superscript markers (e.g. ${ }^{1}$) were retained, and the corresponding notes appeared in the appropriate locations—addressing a key limitation observed in the earlier prompt-driven approach. The markdown-style formatting used in the output is useful for further processing or conversion to other formats such as HTML, PDF, or structured data.

Recommendation

For projects that require accurate and complete extraction of structured text from scanned documents, the Mistral OCR API offers a clear advantage. Unlike the GenAI prompt-based method tested in Mistral Le Chat—which is fast and accessible but may miss secondary elements such as footnotes—the API consistently preserved both the main body of the text and supporting content like annotations and references.

It also provides greater control over file handling, output format, and integration into automated workflows. While the initial setup requires an API key and a publicly accessible file URL, the results justify the extra effort for users who prioritise completeness and consistency in their OCR processes. In short, if your goal is to capture the full content of a scanned PDF—including fine details such as footnotes—the Mistral OCR API is a robust and reliable choice.

The authors used Mistral Le Chat [Mistral (2025) Mistral Le Chat (accessed on 11 April 2025), Large language model (LLM), available at: https://www.mistral.ai] to generate the output.