A deep-learning powered application which turns pdfs into audio files. Featuring ocr improvement and tts with inflection!
Project description
Reading for Listeners (r4l)
I have issues reading pdfs and listening to them helps me out massively! So I'm working on a user-friendly application that can be given a pdf (or txt file) and spit out an MP3 file. In the future, this'll be a fun server that'll do the hard work, but for now, it'll just be a python/bash project. This is a small personal project, so there won't be regular updates per se, but when I have time I'll push what I've got.
Features
Holistic OCR Improvement
The biggest problem with PDFs is they either don't have text within the document (are essentially images) or the existing text (usually the result of OCR) is of poor quality. OCR is often pretty bad on pdfs that I am given, so I use BERT (a masked language model) to improve spell-check results. In future this'll be replaced by Microsoft's TrOCR.
TTS with Inflection
If OCR was the only problem, I'd just use make ocrmypdf output to espeak and we'd be done. Unfortunately, espeak sounds terrible. There's no inflection and it's really hard to pay attention to it for long periods of time. That's where Coqui.ai's TTS comes to the rescue, making hours-long readings bearable.
Always FOSS
The other solutions to this problem are closed source and cost a lot of money. This is free.
Simple UI
Eventually this project will have a neat web UI which'll require very little input from the end user. This is accessibility software after all -- it would be weird if it was hard to use. Unfortunately, for now I only have a cli which is only been tested on linux. Not the best, but I gotta start somewhere.
Install
Windows
The "easiest" way of doing this is by installing WSL with Ubuntu and follow the Ubuntu/debian instructions.
If you're fancy and know how to python on windows, tell me how it goes and how you did it!
Note: unfortunately, it's hard to set up gpu stuff for WSL, and even then only really works for CUDA (NVIDIA) cards, which I have no way of testing as of now (not that I could test any gpu stuff now, but that's beyond the point).
Mac
Gotta say, I have no idea how to get all the dependencies (see ubuntu/debian) on mac. A cursory glance says that brew
or port
should be able to get most of them, but I have no idea about their availability. If you have a mac and figured this out, let me know how you did it!
Ubuntu/Debian
sudo apt install -y python3 python3-venv espeak ffmpeg tesseract-ocr-all python3-dev libenchant-dev libpoppler-cpp-dev pkg-config libavcodec libavtools ghostscript poppler-utils
Make and activate a virtual environment, get pytorch, then run
pip install reading4listeners
And you're all set to run r4l
(see below for usage info)
Install from source
On debian, run
sudo apt install -y python3 python3-venv espeak ffmpeg tesseract-ocr-all python3-dev libenchant-dev libpoppler-cpp-dev pkg-config libavcodec libavtools ghostscript poppler-utils
git clone https://github.com/CypherousSkies/pdf-to-speech
cd pdf-to-speech
python3 -m venv venv
souce venv/bin/activate
pip install -U pip setuptools wheel cython
get pytorch
python setup.py develop
Takes ~2-3GB of disk space for install
Usage
r4l [--in_path in/] [--out_path out/] [--lang en]
runs the suite of scanning and correction on all compatible files in the directory in/
and outputs mp3 files to out/
using the language en
(square brackets denoting optional parameters with default values).
Run r4l --list_langs
to list supported languages
~~This program uses a lot of memory so I'd advise expanding your swap size by ~10GB (for debian use fixswap.sh
)~~ (This should be fixed now, but if it runs out of memory/crashes randomly, increase swap size)
Benchmarks
On my current setup (4 intel i7 8th gen cores, no gpu, debian 10, 5gb ram+7gb swap) takes 0.124*(word count)-3.8
seconds (r^2=0.942,n=6), which is actually pretty good, clocking in at around 10 words per second with some overhead.
Unfortunately, almost all of the pdfs I'm experimenting with are in the 10s of thousands of words, which clocks in at around half an hour, which is less good for getting through my backlog. Ah well.
Under the Hood
At a high level, here's how this works:
input.pdf -> ocrmypdf (ghostscript -> unpaper -> tesseract-ocr) -> preprocessing (regex) -> ocr correction (BERT) -> postprocessing (regex) -> text to speech (Coqui.ai TTS) -> wav to mp3 (pydub) -> out.mp3
Future work
I'll almost certainly need to fine-tune TrOCR/BERT and TTS to better deal with the texts I'm interested in when I get access to a ML rig, but until then, I'll keep using the off-the-shelf models. Hopefully this can all be controlled by a nice, simple web ui and left running on a server for public use. Also I'd like to package this into an executable which requires minimal technical knowledge to use and maintain, but that's a far-off goal.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file reading4listeners-0.0.4.post2.linux-x86_64.tar.gz
.
File metadata
- Download URL: reading4listeners-0.0.4.post2.linux-x86_64.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 136305f0c1721958e753ce1700ea003be844467e8b5ddf89651b9e26ff56d497 |
|
MD5 | f363a0efce8bacc249ace381ddcb05c0 |
|
BLAKE2b-256 | 2d06f075ed0fe1f40d9af07d8643c31872802fa8eec9f99fdd2bb84e6346ed4f |
File details
Details for the file reading4listeners-0.0.4.post2-py3-none-any.whl
.
File metadata
- Download URL: reading4listeners-0.0.4.post2-py3-none-any.whl
- Upload date:
- Size: 21.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ff1548fe41f3c6a256f26a65f8aa9ca020a74c54ce08de1af836879300c602a |
|
MD5 | 2d15ee8058e87d63f8219084b99061f9 |
|
BLAKE2b-256 | 8ff544e7a71c39ba962ce5c7c3a324f9874eb77d02e759f14b40ef8bb16ba3d7 |