How to extract data from PDFs into a spreadsheet - Data Research Analysis Collection

TL;DR

You can pull structured data out of a PDF and into a spreadsheet using free tools like Tabula, Python’s pdfplumber library, or Adobe Acrobat’s export feature, depending on whether the PDF is text-based or scanned. A simple single-table PDF takes under ten minutes. A scanned or multi-page report with messy formatting takes an hour or more and usually requires an OCR step first.

What You Need Before You Start

The PDF file you want to extract from (download a sample invoice or financial report if you’re practicing)
Python 3.10 or later installed locally, with pip available
A spreadsheet app: Excel 365, Google Sheets, or LibreOffice Calc
Java 8 or later if you plan to use Tabula’s desktop app (check with java -version)
An Adobe Acrobat subscription if you want the no-code export route (the free Reader does not export)
pip packages: pdfplumber, pandas, camelot-py[cv], and openpyxl for the Python path
For scanned PDFs: Tesseract OCR installed on your system (brew install tesseract on Mac, sudo apt install tesseract-ocr on Ubuntu)
Optional: a Google Cloud account with the Vision API enabled if you want cloud-based OCR on high-volume batches

Step 1: Identify Whether Your PDF Is Text-Based or Scanned

This step determines every tool choice that follows, so don’t skip it.

Open your PDF in any PDF viewer and try to click and highlight a word. If the text highlights and you can copy it, you have a text-based (or “native”) PDF. The characters are stored as actual text in the file. If nothing highlights, or if you copy garbled characters, you have a scanned image masquerading as a PDF.

A quick command-line check works too. Run:

pdfinfo your-file.pdf | grep "Pages"
python -c "import pdfplumber; p = pdfplumber.open('your-file.pdf'); print(p.pages[0].extract_text()[:200])"

If the second command returns readable text, you are dealing with a native PDF and can skip the OCR steps. If it returns None or empty string, you need OCR.

You should now see either readable text in your terminal output or a blank result, giving you a clear signal on which path to take.

Step 2: Use Copy-Paste for One-Off Simple Tables

If you only need to do this once and the PDF is text-based with a single clean table, the fastest route is manual copy-paste into Google Sheets.

Open the PDF in your browser (Chrome or Firefox). Select all the table text with your cursor. Copy it. Go to Google Sheets and paste it into cell A1. Google Sheets often recognises columns separated by spaces or tabs and splits them automatically.

For Excel, use Data > From PDF (Excel 365 on Windows). Navigate to your file, and Excel’s Power Query interface will detect tables. Select the table preview it shows and click Load. This works surprisingly well on clean financial statements and invoices.

The limitation is obvious: if you have 50 PDFs or if the table spans multiple pages, doing this by hand is not realistic. But for a quick one-off, you do not need to write a single line of code.

You should now see your data sitting in rows and columns in your spreadsheet, ready to clean.

Step 3: Extract Tables from Text-Based PDFs with Tabula

Tabula is a free, open-source tool built specifically for pulling tables out of text-based PDFs. It has both a desktop GUI and a Python wrapper called tabula-py.

Install the Python wrapper:

pip install tabula-py

Then run a basic extraction:

import tabula
import pandas as pd

# Extract all tables from all pages
tables = tabula.read_pdf("your-file.pdf", pages="all", multiple_tables=True)

# tables is a list of DataFrames
for i, df in enumerate(tables):
    print(f"Table {i}: {df.shape}")
    df.to_csv(f"table_{i}.csv", index=False)

Tabula works best when your table has clear borders or consistent column alignment. If you have a multi-page table that flows across pages, add lattice=True for bordered tables or stream=True for borderless ones.

You should now see CSV files in your working directory, one per detected table, that open cleanly in Excel or Google Sheets.

Step 4: Use pdfplumber for More Control Over Text Extraction

When Tabula struggles with irregular layouts, pdfplumber gives you finer control. It lets you define a bounding box around specific parts of the page, which is useful for PDFs where the table is always in the same position.

import pdfplumber
import pandas as pd

rows = []

with pdfplumber.open("your-file.pdf") as pdf:
    for page in pdf.pages:
        # Crop to just the table area (x0, top, x1, bottom in points)
        cropped = page.crop((30, 100, 580, 700))
        table = cropped.extract_table()
        if table:
            rows.extend(table)

df = pd.DataFrame(rows[1:], columns=rows[0])
df.to_excel("output.xlsx", index=False)

The extract_table() method uses whitespace and font changes to detect cell boundaries. For tables without visible lines, this often produces cleaner output than Tabula’s stream mode.

The bounding box values are in PDF points (1 point = 1/72 inch). Use Tabula’s desktop GUI to find the correct coordinates visually if you are not sure.

You should now see an Excel file with your table data, with the header row correctly identified.

Step 5: Run OCR on Scanned PDFs with Tesseract

If your PDF contains scanned images, you need to convert pages to images first and then run optical character recognition. Tesseract OCR is the standard open-source engine.

Install dependencies:

pip install pdf2image pytesseract pillow pandas

Then extract text page by page:

from pdf2image import convert_from_path
import pytesseract
import pandas as pd

pages = convert_from_path("scanned-report.pdf", dpi=300)
all_text = []

for i, page_img in enumerate(pages):
    text = pytesseract.image_to_string(page_img, lang="eng")
    all_text.append({"page": i + 1, "text": text})

df = pd.DataFrame(all_text)
df.to_csv("raw_ocr_output.csv", index=False)

Set DPI to at least 300 for reliable character recognition. Lower resolution produces more errors, especially with small fonts or low-contrast scans. After this step, you will have raw text that still needs parsing into columns.

You should now see a CSV with one row per page and the extracted text in the second column.

Step 6: Parse OCR Text into Structured Columns

Raw OCR output is a wall of text. You need to split it into rows and columns using patterns in the data. Regular expressions are your main tool here.

For example, if each line of your invoice contains an item, quantity, and price in a predictable format:

import re
import pandas as pd

raw_text = open("raw_ocr_output.csv").read()

pattern = r"([A-Za-z ]+)\s+(\d+)\s+\$?([\d,.]+)"
matches = re.findall(pattern, raw_text)

df = pd.DataFrame(matches, columns=["item", "quantity", "price"])
df["price"] = df["price"].str.replace(",", "").astype(float)
df.to_excel("structured_invoice.xlsx", index=False)

If the structure is too inconsistent for regex, consider using an LLM API (GPT-4o or Claude) to parse each page’s text into JSON. This is more expensive but handles messy real-world layouts better. Check out our guide on automating document parsing with Python and LLMs for a full walkthrough.

You should now see a spreadsheet with clean item, quantity, and price columns instead of a blob of text.

Step 7: Clean and Validate the Data in Pandas Before Exporting

Extracted data is almost never clean straight out of the PDF. Column headers come through with extra spaces. Numbers arrive as strings. Rows that were page headers get mixed into the data.

Run a quick cleaning pass:

import pandas as pd

df = pd.read_csv("table_0.csv")

# Strip whitespace from column names
df.columns = df.columns.str.strip()

# Drop rows where all values are NaN (blank separator rows)
df.dropna(how="all", inplace=True)

# Drop rows that are repeated header rows
df = df[df["Item"] != "Item"]

# Convert numeric columns
df["Amount"] = pd.to_numeric(df["Amount"].str.replace("[$,]", "", regex=True), errors="coerce")

# Sanity check
print(df.dtypes)
print(df.head())

df.to_excel("clean_output.xlsx", index=False, engine="openpyxl")

Always validate totals where you can. If your PDF shows a grand total, check that df["Amount"].sum() matches it. A mismatch usually means a row got dropped or a value parsed incorrectly.

You should now see a clean DataFrame with correct data types and no ghost rows, ready to export.

Step 8: Automate the Pipeline for Recurring PDFs

If the same type of PDF arrives weekly or monthly, wrap the whole thing in a script that processes an entire folder in one run.

import pdfplumber
import pandas as pd
from pathlib import Path

INPUT_DIR = Path("pdfs/")
OUTPUT_DIR = Path("outputs/")
OUTPUT_DIR.mkdir(exist_ok=True)

all_frames = []

for pdf_path in INPUT_DIR.glob("*.pdf"):
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            table = page.extract_table()
            if table:
                df = pd.DataFrame(table[1:], columns=table[0])
                df["source_file"] = pdf_path.name
                all_frames.append(df)

combined = pd.concat(all_frames, ignore_index=True)
combined.to_excel(OUTPUT_DIR / "combined_output.xlsx", index=False)
print(f"Processed {len(all_frames)} tables from {len(list(INPUT_DIR.glob('*.pdf')))} files.")

Schedule this with a cron job or a task scheduler. Drop new PDFs into the input folder and the output file updates automatically. For a more robust setup with error logging and email alerts, see our article on building automated data pipelines for analysts.

You should now see a single combined Excel file with all tables merged and a source_file column tracking which PDF each row came from.

Common Mistakes To Avoid

Assuming all PDFs are the same type. A batch of 200 invoices might contain 180 native PDFs and 20 scans. Always check before running one tool across the whole batch.
Skipping the DPI setting for OCR. Tesseract at 72 DPI (screen resolution) produces roughly 70-80% accuracy. At 300 DPI it jumps to 95%+. Always convert at 300 DPI minimum.
Trusting the column count from Tabula without checking. Merged cells and multi-line headers confuse table detection. Always print df.shape and spot-check three or four rows manually.
Not stripping whitespace from column names. A column named " Amount" and one named "Amount" look identical in Excel but break any code that references them by name.
Processing the whole file when you only need one section. If you need data from page 3 of a 400-page annual report, crop to that page first with pdf.pages[2]. Extracting everything and filtering later wastes time and memory.
Using Adobe Acrobat export for scanned PDFs. Acrobat’s export works well for native PDFs but produces inconsistent results on scans unless you have the full Document Cloud OCR pipeline enabled. Know which tool fits which file type.

When To Level Up

The manual and Python approaches in this guide work well for structured PDFs with predictable layouts and volumes under a few hundred files per month. Past that point, the cracks show up fast.

Irregular layouts break regex patterns. Handwritten fields defeat Tesseract. PDFs that encrypt content or use non-standard fonts cause extraction to silently fail. When you hit any of these walls, a dedicated document extraction platform is worth the cost.

Tools like Docsumo, Rossum, or Amazon Textract handle scanned documents, handwriting, and multi-format layouts with trained models. They also give you confidence scores per field so you know which extractions need human review. If your workflow involves high-value data like financial statements or legal documents where errors have real consequences, the accuracy guarantees justify the subscription.

Browse purpose-built extraction and automation tools at /category/automation/ to compare options based on your volume and accuracy requirements.

Frequently Asked Questions

Can I extract data from a password-protected PDF?
If you know the password, pass it to pdfplumber with pdfplumber.open("file.pdf", password="yourpassword"). If you do not have the password, extraction is not possible without the owner’s permission.

Why does my extracted table have random extra columns?
PDF tables do not have a strict grid structure the way HTML tables do. Extra columns usually come from whitespace in the original that the parser treats as a column separator. Try using Tabula with lattice=True if the table has visible borders, or adjust the column separator threshold in pdfplumber.

Does Google Sheets have a built-in PDF import?
Not directly. You can use Google Drive’s built-in OCR by uploading the PDF and opening it with Google Docs, then copying the text. It works for simple layouts but produces messy results on complex tables.

What is the difference between Tabula and Camelot?
Both extract tables from text-based PDFs. Camelot tends to handle tables with no visible borders better and gives you accuracy scores for each detected table. Tabula is easier to get started with and has a GUI. For production pipelines, Camelot’s reporting is more useful for catching bad extractions.

Is there a free cloud option for scanned PDFs?
Google Cloud Vision API has a free tier of 1,000 pages per month. Google Cloud Vision handles multi-language documents and returns bounding box coordinates for each word, which makes it easier to reconstruct table structure from scanned images.

Bottom Line

Extracting data from PDFs into a spreadsheet comes down to two questions: is the PDF text-based or scanned, and is this a one-off or a recurring job. For text-based single files, Tabula or Excel’s built-in Power Query import gets you there in minutes. For scanned files, Tesseract plus pdfplumber gives you a reliable Python pipeline you can run on hundreds of files. For messy real-world layouts that break all the above, a regex or LLM parsing step bridges the gap between raw OCR text and clean columns. The cleaning step in pandas is non-negotiable regardless of which extraction method you use. Start with the simplest tool that works for your specific file, and layer in complexity only when you hit a real obstacle. For the next step in building a fully automated document processing workflow, head to /category/automation/ to explore tools that handle the full pipeline from ingestion to structured output.