Skip to main content

parse PDF files to docx

Project description

pdf2docx

pdf2docx-test codecov pdf2docx-publish GitHub

  • Parse layout (text, image and table) from PDF file with PyMuPDF
  • Generate docx with python-docx

Features

  • Parse and re-create paragraph

    • text in horizontal/vertical direction: from left to right, from bottom to top
    • font style, e.g. font name, size, weight, italic and color
    • text format, e.g. highlight, underline, strike-through
    • text alignment, e.g. left/right/center/justify
    • external hyper link
    • paragraph layout: horizontal alignment and vertical spacing
    • list style
  • Parse and re-create image

    • in-line image
    • image in Gray/RGB/CMYK mode
    • transparent image
    • floating image, i.e. picture behind text
  • Parse and re-create table

    • border style, e.g. width, color
    • shading style, i.e. background color
    • merged cells
    • vertical direction cell
    • table with partly hidden borders
    • nested tables
  • Parsing pages with multi-processing

It can also be used as a tool to extract table contents since both table content and format/style is parsed.

Limitations

  • Text-based PDF file only
  • Normal reading direction only
    • horizontal/vertical paragraph/line/word
    • no word transformation, e.g. rotation

Documentation

Sample

sample_compare.png

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2docx-0.5.1.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

pdf2docx-0.5.1-py3-none-any.whl (108.0 kB view details)

Uploaded Python 3

File details

Details for the file pdf2docx-0.5.1.tar.gz.

File metadata

  • Download URL: pdf2docx-0.5.1.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.2 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.7

File hashes

Hashes for pdf2docx-0.5.1.tar.gz
Algorithm Hash digest
SHA256 42ab1606e70ca3806166a653a3e3dfb026cdbe74c511c6a5d800dca57024ce18
MD5 18b3f2bf84e45386ed881e22472deb68
BLAKE2b-256 1c250a0e9888495e0b3edca35ed01cfc7cf2b5b603e7bac4a665839f105121e8

See more details on using hashes here.

File details

Details for the file pdf2docx-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: pdf2docx-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 108.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.2 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.7

File hashes

Hashes for pdf2docx-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dbe1dd2d85f4a526e52abb6b060df8d74778f0569d91e2672115b7e4f79d58ae
MD5 903294a99048313b839c3e5517f527dc
BLAKE2b-256 4333487af588dfcfa3ea5cceee74856d6775ff922b63721e25654c19daa107e8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page