Docbarcodes extracts 1D and 2D barcodes from scanned PDF documents or images.
Project description
Intro
Barcodes are being used in many documents or forms to enable machine reading capabilities and reduce manual processing effort. Simple 1D barcodes for example can mark a specific page on a form, or indicate a relevant document identification number. More complex 2D barcodes allow to encode even the full data of the document in a more structured form.
Docbarcodes extracts 1D and 2D barcodes from scanned PDF documents or images. It can be used to automate extraction and processing of all kind of documents.
Some working documents with barcodes are:
- Swiss tax statements (Zurich, other cantons can be added as well)
- Swiss salary statement
- US drivers licenses with MRZ (machine readable zone)
- Swiss QR Code Invoices, introduced by Six-group
- Swiss Covid Certificates
The approach works as follows:
- Detect barcode regions on the document using opencv and image transformation heuristics
- Extract the raw barcode data using zxing
- Combine multiple barcodes and decode the data.
Quick start
Required:
- Java 8+
Install package
uv tool install docbarcodes
Download example pdf document
wget https://github.com/ArlindNocaj/document-barcodes/raw/master/data/salary_swissdec/SalarySinglePage.pdf
CLI usage
Extract barcodes from salary statement document and format result json using jq.
docbarcodes ./SalarySinglePage.pdf | jq .
{
"BarcodesRaw": [
{
"page": 0,
"num_candidate": 2,
"raw": "´cý¸z\u0000\u0002V\u0001\u0003\u0000\u0001\u0000\u0001PK\u0003\u0004\u0014\u0000\b\b\b\u0000év\u0003Q\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000txabMÝnâ0\u0010
_Åò}b;¡!¬\u001cWá¯EbYÔÐjÕ;\u0013Ü\u00105±+Û,ð>û&ûb;\tÐö\"Îñ\u0019Ïçùý©mÐ\u001fe]mtYH1Rº4»ZW\u0019~ÞÌ\u0014ß\u000b¾A¦]÷Þ\u007fü äx<îX;·SeXî+÷ªÄíHD#J£B6Ò§ªôpúæ1*\u0016Ó\fS\nm³{é$ÃOLû!õ\u0019=/¦Áx^dxò8\u000bX\u0014×\u000f£5 vkyh$Z)ç\u001be1z|\n&ÁJ¶\nj~ý\\/g¿Ñ¬QïÞ\u001a]¿£ü\u0001£×Å:ÃqÒµ,3o½ª®PáÕAYPJ\u0003·Jù\fÏêJÙ£ªP\u0002=ó[s±D\u0003(®ý9Ãý\u0001²×{£¡#c\u0014Å)\u001aÐa\"ø
r1EKé¼î©rµ1×öÓÐZÞ¨èê\u000b`ùﯮ*éó\u0016V
¢ï$É\b\n¦MOs£\u001a+«Ñ\bvæ ½í\u0002\u0017/A^\u0004++º\u0019\u000eîa(\u000b£Ï\u0010'7bÁsÁ§¦\u0004Å8¹î:µÙ\tþfM+\"ÊÑB¼78ôª«ÏrrqúSûÂ\u0005<¥VAÊX\u001a\u000e!ájð\u0007k»n\u0014Æ\u0010R~wùJù«¼KâAÜ'|y\u0000/ÙÿPK\u0007\bK¦nJÎ\u0001\u0000\u0000Á\u0002\u0000\u0000PK\u0001\u0002\u0014\u0000\u0014\u0000\b\b\b\u0000év\u0003QK¦nJÎ\u0001\u0000\u0000Á\u0002\u0000\u0000\u0004\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000txabPK\u0005\u0006\u0000\u0000\u0000\u0000\u0001\u0000\u0001\u00002\u0000\u0000\u0000\u0000\u0002\u0000\u0000\u0000\u0000",
"format": "PDF_417",
"points": [
[
0.08509771986970684,
0.2564102564102564
],
[
0.32166123778501626,
0.2564102564102564
],
[
0.32166123778501626,
0.35522904062229904
],
[
0.08509771986970684,
0.35522904062229904
]
],
"resultMetadata": {
"ERROR_CORRECTION_LEVEL": "2",
"PDF417_EXTRA_METADATA": {
"addressee": "None",
"checksum": -1,
"fileId": "None",
"fileName": "None",
"fileSize": -1,
"optionalData": "None",
"segmentCount": -1,
"segmentIndex": 0,
"sender": "None",
"timestamp": -1
}
}
}
],
"BarcodesCombined": [
{
"content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?><T xmlns=\"http://www.swissdec.ch/schema/sd/20200220/SalaryDeclarationTxAB\" SID=\"000\" SysV=\"001\"><Company UID-BFS=\"CHE-123.123.123\" Person=\"Paula Nestler\" HR-RC-Name=\"COMPLEX Elektronik AG\" ZIP=\"3600\" CL=\"Abteilung Steuerungen\" Street=\"Eigerweg 6\" Postbox=\"124\" City=\"Thun\" Phone=\"033 238 49 71\"/><PersonID Lastname=\"Aebi\" Firstname=\"Anna\" ZIP=\"3000\" CL=\"\" Street=\"Länggassstrasse 26\" Postbox=\"690\" Locality=\"\" City=\"Bern 9\" Country=\"\"><SV-AS-Nr>123.4567.8901.28</SV-AS-Nr></PersonID><A><DocID>1</DocID><Period><from>2016-10-01</from><until>2016-11-30</until></Period><Income>48118.70</Income><GrossIncome>68000.00</GrossIncome><NetIncome>56343.00</NetIncome></A></T>",
"format": "PDF_417",
"sources": [
0
]
}
]
}
Code Usage
from docbarcodes.extract import process_document
barcodes_raw, barcodes_combined = process_document("./SalarySinglePage.pdf")
print(barcodes_raw)
print(barcodes_combined)
Development
Use uv for local development and testing:
uv sync --extra dev
uv run pytest
If jpype1 needs to build from source on your platform, install Apache Ant alongside Java before running uv sync.
FAQ
PDF rendering is handled by pypdfium2, which bundles its own PDFium binary — no external poppler installation is required.
OpenCV is pulled in via opencv-python-headless, so no system OpenGL libraries (libGL.so.1) are required on headless servers or containers.
Show package licenses
pip-licenses --with-urls --with-system --format=markdown
Improvements to be made:
- implement multithreading class for zxing in java which returns proper objects for python consumption
- extension mechanisms for other 2D barcode aggregations
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docbarcodes-1.1.5.tar.gz.
File metadata
- Download URL: docbarcodes-1.1.5.tar.gz
- Upload date:
- Size: 6.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
459737da910372c9447b45b6e67e889c9079ea4c5658e129703b10de5621c83a
|
|
| MD5 |
ee363de21e4b4c8bb33225256bb54dc7
|
|
| BLAKE2b-256 |
c93fd61d249f8784fbddbd0dbca6cfeab9b2f425b5d2dcc7ef0471ce72aec63d
|
Provenance
The following attestation bundles were made for docbarcodes-1.1.5.tar.gz:
Publisher:
workflow.yml on ArlindNocaj/document-barcodes
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docbarcodes-1.1.5.tar.gz -
Subject digest:
459737da910372c9447b45b6e67e889c9079ea4c5658e129703b10de5621c83a - Sigstore transparency entry: 1538800978
- Sigstore integration time:
-
Permalink:
ArlindNocaj/document-barcodes@4a540f342e11595adfebe6af4f6fa0ba24a2dc65 -
Branch / Tag:
refs/tags/v1.1.5 - Owner: https://github.com/ArlindNocaj
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@4a540f342e11595adfebe6af4f6fa0ba24a2dc65 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docbarcodes-1.1.5-py3-none-any.whl.
File metadata
- Download URL: docbarcodes-1.1.5-py3-none-any.whl
- Upload date:
- Size: 557.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e33aefac586fbcbd2ed800e888311560319ee00934e3e4323b6aa5010d77201
|
|
| MD5 |
90764d9d900daff5aab50f26e4f5c81b
|
|
| BLAKE2b-256 |
42bfecef0c9ec68777f34ba1c29813750a8d6abf03da34461bde25045b0e2d6a
|
Provenance
The following attestation bundles were made for docbarcodes-1.1.5-py3-none-any.whl:
Publisher:
workflow.yml on ArlindNocaj/document-barcodes
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docbarcodes-1.1.5-py3-none-any.whl -
Subject digest:
1e33aefac586fbcbd2ed800e888311560319ee00934e3e4323b6aa5010d77201 - Sigstore transparency entry: 1538801052
- Sigstore integration time:
-
Permalink:
ArlindNocaj/document-barcodes@4a540f342e11595adfebe6af4f6fa0ba24a2dc65 -
Branch / Tag:
refs/tags/v1.1.5 - Owner: https://github.com/ArlindNocaj
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@4a540f342e11595adfebe6af4f6fa0ba24a2dc65 -
Trigger Event:
push
-
Statement type: