Skip to main content

A python library and CLI tool to convert PDF files to CSV files.

Project description

PDF to CSV Converter

Package version Supported Python versions codecov License Stars Issues

This project provides a tool to convert tables from PDF files into CSV or XLSX format using the Docling library. It extracts tables from PDFs and saves them as CSV or XLSX files, optionally reversing text for right-to-left languages.

How It Works

  1. PDF Input: Provide the path to the PDF file you want to convert.
  2. Table Extraction: The tool uses Docling's DocumentConverter to extract tables from the PDF.
  3. DataFrame Conversion: Each extracted table is converted into a pandas DataFrame.
  4. Optional Text Reversal: If the rtl option is enabled, text in the DataFrame is reversed.
  5. CSV/XLSX Output: The DataFrames are saved as CSV or XLSX files in the specified output directory.

Dependencies

This project heavily depends on the Docling library for PDF table extraction.

CLI Usage

You can use the CLI tool to convert PDF files to CSV or XLSX:

pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose

Example:

pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose

With uvx

You can use the CLI tool with uvx:

uvx pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --output-format <csv|xlsx> --rtl --verbose

Example:

uvx pdf2csv convert-cli example.pdf --output-dir ./output --output-format xlsx --rtl --verbose

Python Usage

You can also use the converter directly in your Python code:

from pdf2csv.converter import convert

pdf_path = "example.pdf"
output_dir = "./output"
rtl = True
output_format = "xlsx"

dfs = convert(pdf_path, output_dir=output_dir, rtl=rtl, output_format=output_format)
for df in dfs:
    print(df)

TODO:

  • Convert datatype to numeric
  • Support for XLSX output

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2csv-0.2.1.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2csv-0.2.1-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf2csv-0.2.1.tar.gz.

File metadata

  • Download URL: pdf2csv-0.2.1.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.18

File hashes

Hashes for pdf2csv-0.2.1.tar.gz
Algorithm Hash digest
SHA256 4545d46655cf079a5690c01a3f412dbb5f6a4cb743de938f40f919ce10d6c582
MD5 7523b6ad57fa5e13b8672046cb2633ef
BLAKE2b-256 b3f3b7f62f98e14a898ad9551226602070ff21c8f37d1c4417a8c53b4da7865f

See more details on using hashes here.

File details

Details for the file pdf2csv-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: pdf2csv-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.18

File hashes

Hashes for pdf2csv-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 57823d1804f4977cb26db2f02e34f9b6b38a641001a0650faa9001ed7704db30
MD5 5d8f0a71b1f3c0ae53eaf1f53e882e9f
BLAKE2b-256 a0c889babe90bc5ee5e0409943efe90eeda8075e6f5904df716507c4a8d8aae2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page