parse PDF files to docx
Project description
pdf2docx
- Parse layout (text, image and table) from PDF file with
PyMuPDF
- Generate docx with
python-docx
Features
-
Parse and re-create paragraph
- text in horizontal/vertical direction: from left to right, from bottom to top
- font style, e.g. font name, size, weight, italic and color
- text format, e.g. highlight, underline, strike-through
- text alignment, e.g. left/right/center/justify
- external hyper link
- paragraph layout: horizontal alignment and vertical spacing
- list style
-
Parse and re-create image
- in-line image
- image in Gray/RGB/CMYK mode
- transparent image
- floating image, i.e. picture behind text
-
Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- table with partly hidden borders
- nested tables
-
Parsing pages with multi-processing
It can also be used as a tool to extract table contents since both table content and format/style is parsed.
Limitations
- Text-based PDF file only
- Normal reading direction only
- horizontal/vertical paragraph/line/word
- no word transformation, e.g. rotation
Documentation
Sample
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf2docx-0.5.1.tar.gz
(2.1 MB
view details)
Built Distribution
pdf2docx-0.5.1-py3-none-any.whl
(108.0 kB
view details)
File details
Details for the file pdf2docx-0.5.1.tar.gz
.
File metadata
- Download URL: pdf2docx-0.5.1.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.2 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42ab1606e70ca3806166a653a3e3dfb026cdbe74c511c6a5d800dca57024ce18 |
|
MD5 | 18b3f2bf84e45386ed881e22472deb68 |
|
BLAKE2b-256 | 1c250a0e9888495e0b3edca35ed01cfc7cf2b5b603e7bac4a665839f105121e8 |
File details
Details for the file pdf2docx-0.5.1-py3-none-any.whl
.
File metadata
- Download URL: pdf2docx-0.5.1-py3-none-any.whl
- Upload date:
- Size: 108.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.2 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dbe1dd2d85f4a526e52abb6b060df8d74778f0569d91e2672115b7e4f79d58ae |
|
MD5 | 903294a99048313b839c3e5517f527dc |
|
BLAKE2b-256 | 4333487af588dfcfa3ea5cceee74856d6775ff922b63721e25654c19daa107e8 |