Your first PDF → Markdown pipeline¶
By the end of this tutorial you will have installed turboocr, pointed it
at a running TurboOCR server, OCR'd a sample image, converted a real PDF
invoice to Markdown, and saved the result to disk. It takes about ten
minutes if the server is already running.
What you need¶
- Python 3.12 or newer.
- A running TurboOCR server. Anywhere reachable over HTTP works; the
default is
http://localhost:8000. - The bundled
acme_invoice.pdfandacme_invoice.pngfixtures (clone the repo or download them fromexamples/sample/).
This tutorial assumes the server is configured for Latin OCR, which is the bundled default.
Step 1 — Install the package¶
This pulls in the HTTP client, the CLI, and the searchable-PDF generator.
You do not need the [grpc] extra unless you specifically want gRPC; for
this tutorial, plain HTTP is enough.
Verify the install:
Step 2 — Connect to the server¶
Create a Client. With no arguments it reads
TURBO_OCR_BASE_URL from the environment, falling back to
http://localhost:8000:
If your server lives somewhere else, pass it explicitly:
Quick sanity check — ask the server if it is healthy:
A True and a 200 mean you are ready to OCR something.
Step 3 — OCR your first image¶
Point the client at the sample PNG. The default response gives you a flat list of text items, each with a bounding box and a confidence score:
from pathlib import Path
from turboocr import Client
IMAGE = Path("examples/sample/acme_invoice.png")
client = Client()
response = client.recognize_image(IMAGE)
print(f"recognized {len(response.results)} text items")
for item in response.results[:3]:
print(f" {item.text!r} (conf={item.confidence:.2f})")
You should see roughly 71 items, starting with 'ACME Corporation'.
Step 4 — Move up to a PDF¶
PDFs are multi-page, so the response shape is slightly different — you get
a PdfResponse with a .pages list. Pass
layout=True, reading_order=True, and include_blocks=True so the
server returns the geometric structure we will need in the next step:
PDF = Path("examples/sample/acme_invoice.pdf")
response = client.recognize_pdf(
PDF,
dpi=150,
layout=True,
reading_order=True,
include_blocks=True,
)
print(f"pages={len(response.pages)}")
A 2-page invoice comes back in well under a second on a modern GPU deployment — the server throughputs ~200 image-pages per second. The SDK's default 30-second timeout is plenty; bump it only for very large multi-hundred-page PDFs.
Step 5 — Render to Markdown¶
render_to_markdown walks the reading-order
blocks and turns them into a MarkdownDocument:
from turboocr import render_to_markdown
doc = render_to_markdown(response)
print(f"chars={len(doc.markdown)}")
print(doc.markdown[:200])
Step 6 — Save it¶
MarkdownDocument.markdown is a plain string, so writing it to disk is
one line:
Open the file in any Markdown viewer — headings, paragraphs, and tables all come through as readable Markdown.
Where to go next¶
- For a recipe-shaped reminder of how to tweak retries, see Configure retries.
- To scale this from one file to a whole folder, see the Folder pipeline recipe.
- For background on what
blocksandreading_orderactually mean, see Layout & reading order.