parse PDF files to docx
Project description
English | ä¸æ–‡
pdf2docx
- Extract data from PDF with
PyMuPDF
, e.g. text, images and drawings - Parse layout with rule, e.g. sections, paragraphs, images and tables
- Generate docx with
python-docx
Features
-
Parse and re-create page layout
- page margin
- section and column (1 or 2 columns only)
- page header and footer [TODO]
-
Parse and re-create paragraph
- OCR text [TODO]
- text in horizontal/vertical direction: from left to right, from bottom to top
- font style, e.g. font name, size, weight, italic and color
- text format, e.g. highlight, underline, strike-through
- list style [TODO]
- external hyper link
- paragraph horizontal alignment (left/right/center/justify) and vertical spacing
-
Parse and re-create image
- in-line image
- image in Gray/RGB/CMYK mode
- transparent image
- floating image, i.e. picture behind text
-
Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- table with partly hidden borders
- nested tables
-
Parsing pages with multi-processing
It can also be used as a tool to extract table contents since both table content and format/style is parsed.
Limitations
- Text-based PDF file
- Left to right language
- Normal reading direction, no word transformation / rotation
- Rule-based method can't 100% convert the PDF layout
Documentation
Sample
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
RGpdfconverter-0.3.tar.gz
(3.1 MB
view hashes)
Built Distribution
RGpdfconverter-0.3-py3-none-any.whl
(129.1 kB
view hashes)
Close
Hashes for RGpdfconverter-0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 55500d6cbcc7a42233e85d1145dd51a0faa474b85dee4c7fda03d002ece46edf |
|
MD5 | 9435d1fdc95145bb3077c8a1f6af4665 |
|
BLAKE2b-256 | 015431d9bcdb43853c86aeee6fc449e419d9c995f95a4d26e6779f0228889404 |