parse PDF files to docx
Project description
pdf2docx
- Parse text, table and layout from PDF file with
PyMuPDF
- Generate docx with
python-docx
Features
- Parse and re-create text format
- font style, e.g. font name, size, weight, italic and color
- highlight, underline, strike-through converted from docx
- highlight, underline, strike-through applied from PDF annotations
- Parse and re-create list style
- Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- Rebuild page layout in docx
- paragraph layout: horizontal and vertical spacing
- in-line image
Limitations
- text-based PDF file only
- Normal reading direction only
- horizontal paragraph/line/word
- no word transformation, e.g. rotation
- No floating images
- Full borders table only
Installation
From Pypi
$ pip install pdf2docx
From source code
Clone or download this project, and navigate to the root directory:
$ python setup.py install
Or install it in developing mode:
$ python setup.py develop
Uninstall
$ pip uninstall pdf2docx
Usage
By range of pages
$ pdf2docx test.pdf test.docx --start=5 --end=10
By page numbers
$ pdf2docx test.pdf test.docx --pages=5,7,9
$ pdf2docx --help
NAME
pdf2docx - Run the pdf2docx parser
SYNOPSIS
pdf2docx PDF_FILE DOCX_FILE <flags>
DESCRIPTION
Run the pdf2docx parser
POSITIONAL ARGUMENTS
PDF_FILE
PDF filename to read from
DOCX_FILE
DOCX filename to write to
FLAGS
--start=START
first page to process, starting from zero
--end=END
last page to process, starting from zero
--pages=PAGES
range of pages
NOTES
You can also use flags syntax for POSITIONAL ARGUMENTS
As a source package
import os
from pdf2docx.reader import Reader
from pdf2docx.writer import Writer
dir_output = '/path/to/output/dir/'
filename = 'demo-text'
pdf_file = os.path.join(dir_output, f'{filename}.pdf')
docx_file = os.path.join(dir_output, f'{filename}.docx')
pdf = Reader(pdf_file, debug=True) # debug mode to plot layout in new PDF file
docx = Writer()
for page in pdf[0:1]:
# parse raw layout
layout = pdf.parse(page)
# re-create docx page
docx.make_page(layout)
docx.save(docx_file)
pdf.close()
Sample
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf2docx-0.0.1.tar.gz
(411.4 kB
view hashes)
Built Distribution
pdf2docx-0.0.1-py3-none-any.whl
(59.1 kB
view hashes)