Host your own local PDF server applying OCR and duplex scanning on your documents
Project description
pyPDFserver
pyPDFserver provides a bridge FTP server accepting PDFs (for example from your network printer) and applies OCR, image optimization and/or merging to a duplex scan. The final PDF is uploaded to your target machine (e.g. you NAS) via FTP.
Installation
pyPDFserver is designed to run in a Docker container, but you can also host it manually. First, install Python (>= 3.10) and install pyPDFserver via pip
pip install pyPDFserver
Then you need to install the external dependencies for ocrmypdf (e.g. tesseract, ghostscript) by following this manual: https://ocrmypdf.readthedocs.io/en/latest/installation.html. You can then run pyPDFserver with
python -m pyPDFserver
After first run, two configruation files will be created in your systems configruation folder (refer to the console output to extract the exact paths) named pyPDFserver.ini and profiles.ini. You need to modify them with your settings and restart pyPDFserver.
Usage
Now simply connect to your FTP server and upload files. After some time (OCR may take several minutes), they will be uploaded to your server.
OCR
pyPDFserver uses OCRmyPDF to apply OCR to your PDF. Simply set ocr_enabled to True in your profile to apply OCR to your files. Please note that you should define an language in the profile.ini to get the best OCR results.
Duplex scan
pyPDFserver allows you to automatically merge two scans of the front and back pages (i.e. duplex 1 and duplex 2) into a single file. This is intended to be used with an Automatic Document Feeder (ADF). Keep the following in mind:
- The uploaded files must match the
input_duplex1_nameandinput_duplex1_nametemplates in your profile.ini - The back pages must have reversed order in the pdf file (as you simply turn them around for scanning)
- The page count of both files must match or the task is rejected
Commands
At any time you can see your progress in the console by using
- tasks list: List all running and recently finished or failed tasks
Other useful commands are
- exit: Terminate the server and clear temporary files
- version: List the installed version
Some internal commands you don't usually need to use:
- tasks force_clear: Clear all scheduled and finished tasks (does not abort the current task)
- artifacts list: Internal command to list all artifacts
- artifacts clean: Remove some untracked artifacts to release some storage (usually not needed)
Configruation
pyPDFserver.ini
[SETTINGS]
# Set here the desired log level (CRITICAL, ERROR, WARNING, INFO, DEBUG)
log_level = INFO
# If set to False, disable interactive console input
use_prompt_session = False
# If set to true, use colors for the console output
log_colors = True
# If set to true, create log files
log_to_file = True
# Time for the backpages of a duplex scan to arrive after the front page upload before
# timing out. Set to zero to disable the timeout
duplex_timeout = 600
# If set to True, pyPDFserver will search after start for old temporary files and delete them
clean_old_temporary_files = True
[FTP]
host =
local_ip = 127.0.0.1
port = 21
# Define passive ports as a comma seperated list, e.g. 6000,6001,6010-6020,6030
# If running behind a NAT (e.g. in a Docker container), you should define some ports here
# and allow them in the network setings of your firewall
passive_ports = 23001-23010
[EXPORT_FTP_SERVER]
# Set here the address and credentials for the external FTP server
host =
port =
username =
password =
# If your pyPDFserver is running behind a NAT (e.g. in a Docker container), you may want
# to set control ports (the port used to open a connection to the external FTP server)
# and allow them in the network settings of your firewall
control_port = 23000
profiles.ini
# You can define different profiles to use different settings (e.g. different languages for OCR,
# different optimization levels or file names). Every profile must have a unique username.
# All other fields fallback to the DEFAULT profile if not provided.
[DEFAULT]
# The username for the FTP server
username = pyPDFserver
# The password for the FTP server. Note that after first run it will be replaced with
# a hash value. To change it, remove its value and set it your password. After next run,
# it will be again replace with it hash value
password =
# OCR settings
# Refer to https://ocrmypdf.readthedocs.io/en/latest/optimizer.html for a more thorough explanation
ocr_enabled = False
# Set the three letter country code for tesseract OCR. You must first install the language
# pack for tesseract
ocr_language =
# Correct pages that were scanned at a skewed angle by rotating them back into place
# (--deskew option for OCRmyPDF)
ocr_deskew = True
# Optimization level passed to OCRmyPDF
# (e.g. 0: No optimization, 1: lossless optimiations, 2: some lossy optimizations, 3: aggressive optimization)
ocr_optimize = 1
# Attempts to determine the correct orientation for each page and rotates the page if necessary
# (--rotate-pages paramter for OCRmyPDF)
ocr_rotate_pages = True
# (--tesseract-timeout paramter for OCRmyPDF)
ocr_tesseract_timeout = 60
# File name settings
# When uploading a file to pyPDFserver, it is matched against the given template strings
# and rejected if not matching any. You can use tags (which are replaced by pyPDFserver with
# regex commands) to catch groups
# Availabe tags:
# (lang): Catch 3 Letter language code
# (*): Catch everything
# In the export_duple_name you can also use
# (*1): Fill in (*) from duplex1
# (*2): Fill in (*) from duplex2
# If set to true, all file names
input_case_sensitive = True
# Template string for pdf files
input_pdf_name = SCAN_(*).pdf
# Template string to export pdf files
export_pdf_name = Scan_(*).pdf
# Template string for duplex pdf files (1 for front pages, 2 for back pages)
input_duplex1_name = DUPLEX1_(*).pdf
input_duplex2_name = DUPLEX2_(*).pdf
# Template string to export duplex pdf files
export_duplex_name = Scan_(*1)_(lang).pdf
# Path on the external FTP server to upload to
export_path =
# Two example profiles. You can define as many as you like
[DE]
username = pyPDFserver_de
ocr_enabled = True
ocr_language = deu
[EN]
username = pyPDFserver_en
ocr_enabled = True
ocr_language = eng
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pypdfserver-1.0.4.tar.gz.
File metadata
- Download URL: pypdfserver-1.0.4.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6395b4d8e21792232190b1bde2812c39baf783c1a4871ef62db3f2f526965f5f
|
|
| MD5 |
fa7f06c1e87dc7268b7cd8e83ebe621c
|
|
| BLAKE2b-256 |
78026e22a6038644d48506eae172d7310632da98670b779d22527fc98afcc848
|
Provenance
The following attestation bundles were made for pypdfserver-1.0.4.tar.gz:
Publisher:
build.yml on andreasmz/pyPDFserver
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pypdfserver-1.0.4.tar.gz -
Subject digest:
6395b4d8e21792232190b1bde2812c39baf783c1a4871ef62db3f2f526965f5f - Sigstore transparency entry: 782201980
- Sigstore integration time:
-
Permalink:
andreasmz/pyPDFserver@81cf4c91e169313263de6d7ae0dadc6c3d99c16f -
Branch / Tag:
refs/tags/v1.0.4 - Owner: https://github.com/andreasmz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build.yml@81cf4c91e169313263de6d7ae0dadc6c3d99c16f -
Trigger Event:
push
-
Statement type:
File details
Details for the file pypdfserver-1.0.4-py3-none-any.whl.
File metadata
- Download URL: pypdfserver-1.0.4-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f0243317e76cd7dab11973caeb37126b5b9769119f8491746ab8dd826a19b0d
|
|
| MD5 |
a7cb95518210e5cee0e1f4cf41c1351c
|
|
| BLAKE2b-256 |
dc0974df1aaff93a7f00892e0950cdc571b83c9fec85211d5f2d0bcfc2b10ab6
|
Provenance
The following attestation bundles were made for pypdfserver-1.0.4-py3-none-any.whl:
Publisher:
build.yml on andreasmz/pyPDFserver
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pypdfserver-1.0.4-py3-none-any.whl -
Subject digest:
4f0243317e76cd7dab11973caeb37126b5b9769119f8491746ab8dd826a19b0d - Sigstore transparency entry: 782201984
- Sigstore integration time:
-
Permalink:
andreasmz/pyPDFserver@81cf4c91e169313263de6d7ae0dadc6c3d99c16f -
Branch / Tag:
refs/tags/v1.0.4 - Owner: https://github.com/andreasmz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build.yml@81cf4c91e169313263de6d7ae0dadc6c3d99c16f -
Trigger Event:
push
-
Statement type: