Open source Python library converting pdf to docx.
Project description
English | 中文
pdf2docx
- Extract data from PDF with
PyMuPDF
, e.g. text, images and drawings - Parse layout with rule, e.g. sections, paragraphs, images and tables
- Generate docx with
python-docx
Features
-
Parse and re-create page layout
- page margin
- section and column (1 or 2 columns only)
- page header and footer [TODO]
-
Parse and re-create paragraph
- OCR text [TODO]
- text in horizontal/vertical direction: from left to right, from bottom to top
- font style, e.g. font name, size, weight, italic and color
- text format, e.g. highlight, underline, strike-through
- list style [TODO]
- external hyper link
- paragraph horizontal alignment (left/right/center/justify) and vertical spacing
-
Parse and re-create image
- in-line image
- image in Gray/RGB/CMYK mode
- transparent image
- floating image, i.e. picture behind text
-
Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- table with partly hidden borders
- nested tables
-
Parsing pages with multi-processing
It can also be used as a tool to extract table contents since both table content and format/style is parsed.
Limitations
- Text-based PDF file
- Left to right language
- Normal reading direction, no word transformation / rotation
- Rule-based method can't 100% convert the PDF layout
Documentation
Sample
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf2docx-0.5.5.tar.gz
(3.1 MB
view details)
Built Distribution
pdf2docx-0.5.5-py3-none-any.whl
(148.2 kB
view details)
File details
Details for the file pdf2docx-0.5.5.tar.gz
.
File metadata
- Download URL: pdf2docx-0.5.5.tar.gz
- Upload date:
- Size: 3.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d99aea565e1b98281a9743c99df120762846f3d6a3c8027d7121dbe12b4bf3e |
|
MD5 | b08cc77c0cc7f410d035e69198ec2df3 |
|
BLAKE2b-256 | 39443b84b1ac850ae60e23e646a4d154883e5a105b6e625d24512105e24636dd |
File details
Details for the file pdf2docx-0.5.5-py3-none-any.whl
.
File metadata
- Download URL: pdf2docx-0.5.5-py3-none-any.whl
- Upload date:
- Size: 148.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6686ec8489f4ed3fd72b3c1545ce407423a21abed9012faf677cd94e68b913ad |
|
MD5 | 0c75577f210c7d3c368d10e62afe0ed7 |
|
BLAKE2b-256 | be1c39dd3df2a91c20e6103ce910667439065b8f0116b893cf8aba6f95fed2d9 |