Skip to main content

A deep-learning powered application which turns pdfs into audio files. Featuring ocr improvement and tts with inflection!

Project description

Reading for Listeners (r4l)

I have issues reading pdfs and listening to them helps me out massively! So I'm working on a user-friendly application that can be given a pdf (or txt file) and spit out an MP3 file. In the future, this'll be a fun server that'll do the hard work, but for now, it'll just be a python/bash project. This is a small personal project, so there won't be regular updates per se, but when I have time I'll push what I've got.

Features

Holistic OCR Improvement

The biggest problem with PDFs is they either don't have text within the document (are essentially images) or the existing text (usually the result of OCR) is of poor quality. OCR is often pretty bad on pdfs that I am given, so I use BERT (a masked language model) to improve spell-check results. In future this'll be replaced by Microsoft's TrOCR.

TTS with Inflection

If OCR was the only problem, I'd just use make ocrmypdf output to espeak and we'd be done. Unfortunately, espeak sounds terrible. There's no inflection and it's really hard to pay attention to it for long periods of time. That's where Coqui.ai's TTS comes to the rescue, making hours-long readings bearable.

Always FOSS

The other solutions to this problem are closed source and cost a lot of money. This is free.

Simple UI

Eventually this project will have a neat web UI which'll require very little input from the end user. This is accessibility software after all -- it would be weird if it was hard to use.

Install

If you're just going to run it (on debian/ubuntu):

sudo apt install -y python3 python3-venv espeak ffmpeg tesseract-ocr-all python3-dev libenchant-dev libpoppler-cpp-dev pkg-config libavcodec libavtools ghostscript poppler-utils

Make a virtual environment, get pytorch, then run

pip install reading4listeners

And you're all set!

Install from file

On debian, run

sudo apt install -y python3 python3-venv espeak ffmpeg tesseract-ocr-all python3-dev libenchant-dev libpoppler-cpp-dev pkg-config libavcodec libavtools ghostscript poppler-utils

git clone https://github.com/CypherousSkies/pdf-to-speech

cd pdf-to-speech

python3 -m venv venv

souce venv/bin/activate

pip install -U pip setuptools wheel cython

get pytorch

python setup.py develop

Takes ~2-3GB of disk space for install

Usage

r4l [--in_path in/] [--out_path out/] [--lang en] runs the suite of scanning and correction on all compatible files in the directory in/ and outputs mp3 files to out/ using the language en.

Run r4l --list_langs to list supported languages

Benchmarks

On my current setup (4 intel i7 8th gen cores, no gpu, debian 10, 5gb ram+7gb swap) takes 0.124*(word count)-3.8 seconds (r^2=0.942,n=6), which is actually pretty good, clocking in at around 10 words per second with some overhead. Unfortunately, almost all of the pdfs I'm experimenting with are in the 10s of thousands of words, which clocks in at around half an hour, which is less good for getting through my backlog. Ah well.

Under the Hood

At a high level, here's how this works:

input.pdf -> ocrmypdf (ghostscript -> unpaper -> tesseract-ocr) -> preprocessing (regex) -> ocr correction (BERT) -> postprocessing (regex) -> text to speech (Coqui.ai TTS) -> wav to mp3 (pydub) -> out.mp3

Future work

I'll almost certainly need to fine-tune TrOCR/BERT and TTS to better deal with the texts I'm interested in when I get access to a ML rig, but until then, I'll keep using the off-the-shelf models. Hopefully this can all be controlled by a nice, simple web ui and left running on a server for public use. Also I'd like to package this into an executable which requires minimal technical knowledge to use and maintain, but that's a far-off goal.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reading4listeners-0.0.3.post1.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

reading4listeners-0.0.3.post1-py3-none-any.whl (18.3 kB view details)

Uploaded Python 3

File details

Details for the file reading4listeners-0.0.3.post1.tar.gz.

File metadata

  • Download URL: reading4listeners-0.0.3.post1.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.3

File hashes

Hashes for reading4listeners-0.0.3.post1.tar.gz
Algorithm Hash digest
SHA256 e92acbcfbc63f26bed4b22f7de5513681696f7193c0e3ddf4529bc59e1b11297
MD5 b14c85fdd1004cc3af02a6e376a7b113
BLAKE2b-256 bea61197045ff2631e0c365fe28167d258fa4afdb85f0efbca1d6cc1b6790ce2

See more details on using hashes here.

File details

Details for the file reading4listeners-0.0.3.post1-py3-none-any.whl.

File metadata

  • Download URL: reading4listeners-0.0.3.post1-py3-none-any.whl
  • Upload date:
  • Size: 18.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.3

File hashes

Hashes for reading4listeners-0.0.3.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 4ce18e4c64df38fcebd202f1e93e2f332056793a329749f48603da848f10dd6d
MD5 076d1cd638bc7d468fd300126fefad33
BLAKE2b-256 cd90b50d8da02bbe4664afe15c79e8991e29a551ebd223ac198dface3b086bc8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page