parse PDF files to docx
Project description
pdf2docx
- Parse text, table and layout from PDF file with
PyMuPDF - Generate docx with
python-docx
Features
- Parse and re-create text format
- font style, e.g. font name, size, weight, italic and color
- highlight, underline, strike-through converted from docx
- highlight, underline, strike-through applied from PDF annotations
- Parse and re-create list style
- Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- Rebuild page layout in docx
- text in horizontal direction: from left to right
- text in vertical direction: from bottom to top
- in-line image
- paragraph layout: horizontal and vertical spacing
It can also be used as a tool to extract table contents since both table content and format/style is parsed.
Limitations
- Text-based PDF file only
- Normal reading direction only
- horizontal/vertical paragraph/line/word
- no word transformation, e.g. rotation
- No floating images
- Full borders table only
Installation
From Pypi
$ pip install pdf2docx
From source code
Clone or download this project, and navigate to the root directory:
$ python setup.py install
Or install it in developing mode:
$ python setup.py develop
Uninstall
$ pip uninstall pdf2docx
Usage
By range of pages
$ pdf2docx test.pdf test.docx --start=5 --end=10
By page numbers
$ pdf2docx test.pdf test.docx --pages=5,7,9
$ pdf2docx --help
NAME
pdf2docx - Run the pdf2docx parser.
SYNOPSIS
pdf2docx PDF_FILE DOCX_FILE <flags>
DESCRIPTION
Run the pdf2docx parser.
POSITIONAL ARGUMENTS
PDF_FILE
PDF filename to read from
DOCX_FILE
DOCX filename to write to
FLAGS
--start=START
first page to process, starting from zero
--end=END
last page to process, starting from zero
--pages=PAGES
range of pages
--debug=DEBUG
create illustration pdf showing layouts if True, else do nothing
NOTES
You can also use flags syntax for POSITIONAL ARGUMENTS
As a library
''' With this library installed with
`pip install pdf2docx`, or `python setup.py install`.
'''
from pdf2docx.main import parse
pdf_file = '/path/to/sample.pdf'
docx_file = 'path/to/sample.docx'
# convert pdf to docx
parse(pdf_file, docx_file, start=0, end=1)
Or just to extract tables,
from pdf2docx.main import extract_tables
pdf_file = '/path/to/sample.pdf'
tables = extract_tables(pdf_file, start=0, end=1)
for table in tables:
print(table)
# outputs
...
[['Input ', None, None, None, None, None],
['Description A ', 'mm ', '30.34 ', '35.30 ', '19.30 ', '80.21 '],
['Description B ', '1.00 ', '5.95 ', '6.16 ', '16.48 ', '48.81 '],
['Description C ', '1.00 ', '0.98 ', '0.94 ', '1.03 ', '0.32 '],
['Description D ', 'kg ', '0.84 ', '0.53 ', '0.52 ', '0.33 '],
['Description E ', '1.00 ', '0.15 ', None, None, None],
['Description F ', '1.00 ', '0.86 ', '0.37 ', '0.78 ', '0.01 ']]
Sample
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf2docx-0.3.3.tar.gz
(498.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
pdf2docx-0.3.3-py3-none-any.whl
(70.1 kB
view details)
File details
Details for the file pdf2docx-0.3.3.tar.gz.
File metadata
- Download URL: pdf2docx-0.3.3.tar.gz
- Upload date:
- Size: 498.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3642e65e7d8c149c10f3a21058227fe03938b666e200fc6b62ba130948590e2
|
|
| MD5 |
e5795a2a627a37d58335f0922a4b8874
|
|
| BLAKE2b-256 |
4536fe44342cab55e14ab6c2220d3633b3aa28ad17bc50db47ddf285cbabc22e
|
File details
Details for the file pdf2docx-0.3.3-py3-none-any.whl.
File metadata
- Download URL: pdf2docx-0.3.3-py3-none-any.whl
- Upload date:
- Size: 70.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54522bdbe1edd5ac9c2e127923659196c283e165f4b3604cab0bb37406f26642
|
|
| MD5 |
b60292ae1a2f706089e05be60c36b329
|
|
| BLAKE2b-256 |
3420797b0fbb8d4fb69a4135e7b0fa5227310828affaf2a486125ef0c1398d1e
|