Skip to main content

A python library and CLI tool to convert PDF files to CSV files.

Project description

PDF to CSV Converter

PyPI version License

Package version Supported Python versions

This project provides a tool to convert tables from PDF files into CSV format using the Docling library. It extracts tables from PDFs and saves them as CSV files, optionally reversing text for right-to-left languages.

How It Works

  1. PDF Input: Provide the path to the PDF file you want to convert.
  2. Table Extraction: The tool uses Docling's DocumentConverter to extract tables from the PDF.
  3. DataFrame Conversion: Each extracted table is converted into a pandas DataFrame.
  4. Optional Text Reversal: If the rtl option is enabled, text in the DataFrame is reversed.
  5. CSV Output: The DataFrames are saved as CSV files in the specified output directory.

Dependencies

This project heavily depends on the Docling library for PDF table extraction. Ensure you have it installed before running the converter.

CLI Usage

You can use the CLI tool to convert PDF files to CSV:

pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --rtl --verbose

Example:

pdf2csv convert-cli example.pdf --output-dir ./output --rtl --verbose

With uvx

You can use the CLI tool with uvx:

uvx pdf2csv convert-cli <pdf_path> --output-dir <output_dir> --rtl --verbose

Example:

uvx pdf2csv convert-cli example.pdf --output-dir ./output --rtl --verbose

Python Usage

You can also use the converter directly in your Python code:

from pdf2csv.converter import convert

pdf_path = "example.pdf"
output_dir = "./output"
rtl = True

dfs = convert(pdf_path, output_dir=output_dir, rtl=rtl)
for df in dfs:
    print(df)

TODO:

  • Convert datatype to numeric
  • [ ]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2csv-0.1.1.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2csv-0.1.1-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file pdf2csv-0.1.1.tar.gz.

File metadata

  • Download URL: pdf2csv-0.1.1.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.18

File hashes

Hashes for pdf2csv-0.1.1.tar.gz
Algorithm Hash digest
SHA256 55cd2a38b40d9e9a23b0a3f849ed2fb46de824edf4d4c3d5dc165c62d2b378ae
MD5 4f49e4570e7cdbc26e3125beebf16b0b
BLAKE2b-256 c44e96ee280fa3a29b5dfbd7e4a511d40d1e35b1ca749a98ce014b66ee3d0f74

See more details on using hashes here.

File details

Details for the file pdf2csv-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdf2csv-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.18

File hashes

Hashes for pdf2csv-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3473353ab10f3bfda33ae75b290738402044e8524bbd3fd52f92d59ae884c8b4
MD5 983147f021e678795a454d3581f39e0b
BLAKE2b-256 218e5ba54fe612a27be1dbb2d74f34d036d88c8a1d84ecd19ec4133e363a076d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page