Skip to main content

Extracts email metadata and text from a PDF file

Project description

pdf2mbox

a command-line utility and Python package for converting PDF emails to MBOX format

Installation

pip install pdf2mbox

Usage

# from the command line
% python -m pdf2mbox --help
usage: pdf2mbox.py [-h] [--version] [--overwrite] [--csv [CSV]]
                   pdf_file [mbox_file]

Generates an mbox from a PDF containing emails

positional arguments:
  pdf_file         PDF file provided as input
  mbox_file        Mbox file generated as output

optional arguments:
  -h, --help       show this help message and exit
  --version, -v    show program's version number and exit
  --overwrite, -o  overwrite MBOX file if it exists
  --csv [CSV]      generate CSV file output

# from within python
from pdf2mbox import pdf2mbox
pe = pdf2mbox(pdf_file, mbox_file) # pe contains dict of emails

OS Dependencies

If you encounter errors installing pdf2mbox, please check the OS-level dependencies of both the pdftotext and python-magic packages to ensure you have the required libraries installed, as pdf2mbox utilizes both these packages.

Notes

  • Assumes an email ends when a new email begins
  • Works best with a standard email header (i.e., From:, To:, Sent:, Subject:)
  • The initial development of this package was funded in part by The Mellon Foundation’s “Email Archives: Building Capacity and Community” program.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2mbox-0.3.4.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2mbox-0.3.4-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file pdf2mbox-0.3.4.tar.gz.

File metadata

  • Download URL: pdf2mbox-0.3.4.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for pdf2mbox-0.3.4.tar.gz
Algorithm Hash digest
SHA256 3a1696912a8678cea336f72b10c2721d207a2b98d7c4f4ac6649c9ba57749f32
MD5 6f7fc0d42c4e516ad053ced081c49718
BLAKE2b-256 076b49b3ee4e5eee49d56879c6e7687ca9d1f650891aa74e4222ebed9678f9f4

See more details on using hashes here.

File details

Details for the file pdf2mbox-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: pdf2mbox-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for pdf2mbox-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d1e998ce2d3e5838531f1853c3718d205abfb33d3120d4fb68ded152ce16fdc8
MD5 037839215dc93a16f0fdaa5cb24b04d0
BLAKE2b-256 b4a5294179e2265fd488c6872a33bdfc3564b26c30fe6012e42516b0d3ec55aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page