Skip to main content

PDF to semantic HTML conversion.

Project description

PDF to semantic HTML conversion

Transcript contains Python programs whose job is to transcribe PDF into sematic HTML.

pdftranscript

: Get semantic HTML from PDFs converted by pdf2htmlEX.

pdfttf

: Recover lost text from PDFs where true type font characters are nothing more than images of themselves.

pdf2html

: Batch process a folder full of PDFs ready for pdftranscript

Read the docstrings for more information.

Example

PDF before and semantic HTML after

Installation

pip install pdftranscript

Get Python installed along with latest pdf2htmlEX. on OS X with Homebrew:

brew install python3 pdf2htmlEX

or on Ubuntu/Debian

sudo apt update && sudo apt install -y libfontconfig1 libcairo2 libjpeg-turbo8 ttfautohint
wget -o pdf2htmlEX.deb https://github.com/pdf2htmlEX/pdf2htmlEX/releases/download/v0.18.8.rc1/pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb

Check sha256sum pdf2htmlEX.deb matches 4ef2698cbeb6995189ac...

sudo apt install ./pdf2htmlEX.deb
pdf2htmlEX -v

Docker install of pdf2htmlEX is also supported (brew one started failing as of late). This particular image is tested and used in the default config via DOCKER_IMG_TAG.

docker pull
pdf2htmlex/pdf2htmlex:0.18.8.rc2-master-20200820-ubuntu-20.04-x86_64

Install lxml under python3 pip3 install lxml or just run the following and get freetype-py too.

pip3 install -r requirements.txt

Configure

Configure your project path in your .env file and config.py most importantly the DATA_DIR. This can be any folder let's say DATA_DIR=/path/to/pdf-transcript/tests. If you use a docker install of pdf2htmlEX, you'll need to set DOCKER_INSTALL=1 This will mount your data dir to Docker path. DOCKER_IMG_TAG is also configurable. Go ahead create your .env file and add DATA_DIR=...

Your DATA_DIR should end up containing 3 folders: PDF, HTML and HTM if you otherwise stick with default configuration. Create a 'PDF' folder inside and drop your PDFs there.

  • PDF is a folder where your PDFs are.
  • HTML is where pdf2htmlEX output (non-semantic HTML) ends up after running ./pdf2html.py, which just runs pdf2htmlEX with suitable options.
  • HTM is the final destination where semantic HTML gets born after running ./transcript.py.

Run

./pdf2html.py

./transcript.py

When you change configuration within ./transcript.py or tweak some code. You only need to run ./transcript.py

Development process

Set expected (hand-adjusted) output to aim for and improve codebase to get transcript output closer to the ideal semantic output. Make sure your changes don't make output worse for other tests. Use ruff check.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftranscript-1.0.1.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

pdftranscript-1.0.1-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file pdftranscript-1.0.1.tar.gz.

File metadata

  • Download URL: pdftranscript-1.0.1.tar.gz
  • Upload date:
  • Size: 4.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.20

File hashes

Hashes for pdftranscript-1.0.1.tar.gz
Algorithm Hash digest
SHA256 9e97582ba80338cc55fdb8e7cc1a140eaddb44eec0419f40fdb396e561b16f74
MD5 70dbc9bfb81ea3ec97b59ba24af3d548
BLAKE2b-256 9f92cb14efac34b9270096d917568c8aff97a5fef0260410682400db0c2e355c

See more details on using hashes here.

File details

Details for the file pdftranscript-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pdftranscript-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1a217dadb74889204ad44a6c27d656d3d2444609e76c714a8c39a883fff28cbd
MD5 229638dd083a7b343f3fb27ef1d7e61c
BLAKE2b-256 d25fd5e5765ed7ea153feac542403cd4005d072fd3a74c9466b63ed4a9a0193a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page