PDF to semantic HTML conversion.
Project description
PDF to semantic HTML conversion
Transcript contains Python programs whose job is to transcribe PDF into sematic HTML.
: Get semantic HTML from PDFs converted by pdf2htmlEX.
: Recover lost text from PDFs where true type font characters are nothing more than images of themselves.
: Batch process a folder full of PDFs ready for pdftranscript
Read the docstrings for more information.
Example
PDF before and semantic HTML after
Installation
pip install pdftranscript
Get Python installed along with latest pdf2htmlEX. on OS X with Homebrew:
brew install python3 pdf2htmlEX
or on Ubuntu/Debian
sudo apt update && sudo apt install -y libfontconfig1 libcairo2 libjpeg-turbo8 ttfautohint
wget -o pdf2htmlEX.deb https://github.com/pdf2htmlEX/pdf2htmlEX/releases/download/v0.18.8.rc1/pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb
Check sha256sum pdf2htmlEX.deb
matches 4ef2698cbeb6995189ac...
sudo apt install ./pdf2htmlEX.deb
pdf2htmlEX -v
Docker install of pdf2htmlEX is also supported (brew one started failing
as of late). This particular image is tested and used in the default
config via DOCKER_IMG_TAG
.
docker pull
pdf2htmlex/pdf2htmlex:0.18.8.rc2-master-20200820-ubuntu-20.04-x86_64
Install lxml under python3 pip3 install lxml
or just run the following
and get freetype-py too.
pip3 install -r requirements.txt
Configure
Configure your project path in your .env
file and config.py
most
importantly the DATA_DIR. This can be any folder let's say
DATA_DIR=/path/to/pdf-transcript/tests
. If you use a docker install
of pdf2htmlEX, you'll need to set DOCKER_INSTALL=1
This will mount
your data dir to Docker path. DOCKER_IMG_TAG
is also
configurable. Go ahead create your .env
file and add
DATA_DIR=...
Your DATA_DIR should end up containing 3 folders: PDF, HTML and HTM if you otherwise stick with default configuration. Create a 'PDF' folder inside and drop your PDFs there.
- PDF is a folder where your PDFs are.
- HTML is where pdf2htmlEX output (non-semantic HTML) ends up after
running
./pdf2html.py
, which just runs pdf2htmlEX with suitable options. - HTM is the final destination where semantic HTML gets born after
running
./transcript.py
.
Run
./pdf2html.py
./transcript.py
When you change configuration within ./transcript.py
or tweak some
code. You only need to run ./transcript.py
Development process
Set expected (hand-adjusted) output to aim for and improve codebase to
get transcript output closer to the ideal semantic output. Make sure
your changes don't make output worse for other tests. Use
ruff check
.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdftranscript-1.0.1.tar.gz
.
File metadata
- Download URL: pdftranscript-1.0.1.tar.gz
- Upload date:
- Size: 4.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.4.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e97582ba80338cc55fdb8e7cc1a140eaddb44eec0419f40fdb396e561b16f74 |
|
MD5 | 70dbc9bfb81ea3ec97b59ba24af3d548 |
|
BLAKE2b-256 | 9f92cb14efac34b9270096d917568c8aff97a5fef0260410682400db0c2e355c |
File details
Details for the file pdftranscript-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: pdftranscript-1.0.1-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.4.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a217dadb74889204ad44a6c27d656d3d2444609e76c714a8c39a883fff28cbd |
|
MD5 | 229638dd083a7b343f3fb27ef1d7e61c |
|
BLAKE2b-256 | d25fd5e5765ed7ea153feac542403cd4005d072fd3a74c9466b63ed4a9a0193a |