Skip to main content

Docbarcodes extracts 1D and 2D barcodes from scanned PDF documents or images.

Project description

Build and Test

Intro

Barcodes are being used in many documents or forms to enable machine reading capabilities and reduce manual processing effort. Simple 1D barcodes for example can mark a specific page on a form, or indicate a relevant document identification number. More complex 2D barcodes allow to encode even the full data of the document in a more structured form.

Docbarcodes extracts 1D and 2D barcodes from scanned PDF documents or images. It can be used to automate extraction and processing of all kind of documents.

Some working documents with barcodes are:

  • Swiss tax statements (Zurich, other cantons can be added as well)
  • Swiss salary statement
  • US drivers licenses with MRZ (machine readable zone)
  • Swiss QR Code Invoices, introduced by Six-group
  • Swiss Covid Certificates

The approach works as follows:

  1. Detect barcode regions on the document using opencv and image transformation heuristics
  2. Extract the raw barcode data using zxing
  3. Combine multiple barcodes and decode the data.

Quick start

Required:

  • Java 8+

Install package

uv tool install docbarcodes

Download example pdf document

wget https://github.com/ArlindNocaj/document-barcodes/raw/master/data/salary_swissdec/SalarySinglePage.pdf 

CLI usage

Extract barcodes from salary statement document and format result json using jq.

docbarcodes ./SalarySinglePage.pdf | jq .
{
  "BarcodesRaw": [
    {
      "page": 0,
      "num_candidate": 2,
      "raw": "´cý¸z\u0000\u0002V\u0001\u0003\u0000\u0001\u0000\u0001PK\u0003\u0004\u0014\u0000\b\b\b\u0000év\u0003Q\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000txabM’Ýnâ0\u0010…_Åò}b;¡!¬\u001cWá¯EbYÔÐjÕ;\u0013Ü\u00105±+Û,ð>û&ûb;\tÐö\"Îñ™\u0019Ïç‘ùý©mÐ\u001fe]mt†YH1Rº4»ZW\u0019~Þ̃\u0014ß\u000b¾A¦]†÷Þ\u007fü äx<†îX;·SeXî‰+÷ª•ÄíHD#J£ˆ’B6Ҟ§ª„Ÿôpú攏1*\u0016Ó\fS\nmг{é$ÂOLû!õ\u0019=/¦Áx^dxò8\u000bX\u0014‡×\u000f£5 v„kyh$Z)ç\u001be1z|\nž&ÁJ¶\nj~ý\\/g¿Ñ¬QïÞ\u001a]¿£ü\u0001£×Å:ÃqÒµœ,3œo½ª›ƒ®PáÕAYPJ\u0003·Jù\fÏêJÙ£ªP\u0002=ó[s‚±D\u0003(®ý9Ûý\u0001²×{£¡#c\u0014Å)\u001aŒÐa\"ø…r1EKé¼î©rµ1š×öÓÐZÞ¨èê\u000b`ùﯮ*éœó\u0016V…¢ï$É\b\n–¦”MOs£\u001a+«Ñ\bvæ ½í\u0002‚\u0017/A^\u0004++º\u0019\u000eî’a˜Ž(\u000b£”“Ï\u0010'7bÁsÁ§¦\u0004Å8¹ˆî:µÙ\tþfM+\"ʒ€Ñ€B¼78ôª›«Ï‚˜rrqúSûÂ\u0005<¥V‰AÊX\u001a\u000e!ájð\u0007kœ»n’\u0014Æ\u0010Rˆ~wùJù«¼KâAÜ'|yœ\u0000/وÿPK\u0007\bK¦nJÎ\u0001\u0000\u0000Á\u0002\u0000\u0000PK\u0001\u0002\u0014\u0000\u0014\u0000\b\b\b\u0000év\u0003QK¦nJÎ\u0001\u0000\u0000Á\u0002\u0000\u0000\u0004\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000txabPK\u0005\u0006\u0000\u0000\u0000\u0000\u0001\u0000\u0001\u00002\u0000\u0000\u0000\u0000\u0002\u0000\u0000\u0000\u0000",
      "format": "PDF_417",
      "points": [
        [
          0.08509771986970684,
          0.2564102564102564
        ],
        [
          0.32166123778501626,
          0.2564102564102564
        ],
        [
          0.32166123778501626,
          0.35522904062229904
        ],
        [
          0.08509771986970684,
          0.35522904062229904
        ]
      ],
      "resultMetadata": {
        "ERROR_CORRECTION_LEVEL": "2",
        "PDF417_EXTRA_METADATA": {
          "addressee": "None",
          "checksum": -1,
          "fileId": "None",
          "fileName": "None",
          "fileSize": -1,
          "optionalData": "None",
          "segmentCount": -1,
          "segmentIndex": 0,
          "sender": "None",
          "timestamp": -1
        }
      }
    }
  ],
  "BarcodesCombined": [
    {
      "content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?><T xmlns=\"http://www.swissdec.ch/schema/sd/20200220/SalaryDeclarationTxAB\" SID=\"000\" SysV=\"001\"><Company UID-BFS=\"CHE-123.123.123\" Person=\"Paula Nestler\" HR-RC-Name=\"COMPLEX Elektronik AG\" ZIP=\"3600\" CL=\"Abteilung Steuerungen\" Street=\"Eigerweg 6\" Postbox=\"124\" City=\"Thun\" Phone=\"033 238 49 71\"/><PersonID Lastname=\"Aebi\" Firstname=\"Anna\" ZIP=\"3000\" CL=\"\" Street=\"Länggassstrasse 26\" Postbox=\"690\" Locality=\"\" City=\"Bern 9\" Country=\"\"><SV-AS-Nr>123.4567.8901.28</SV-AS-Nr></PersonID><A><DocID>1</DocID><Period><from>2016-10-01</from><until>2016-11-30</until></Period><Income>48118.70</Income><GrossIncome>68000.00</GrossIncome><NetIncome>56343.00</NetIncome></A></T>",
      "format": "PDF_417",
      "sources": [
        0
      ]
    }
  ]
}

Code Usage

from docbarcodes.extract import process_document

barcodes_raw, barcodes_combined = process_document("./SalarySinglePage.pdf")

print(barcodes_raw)
print(barcodes_combined)

Development

Use uv for local development and testing:

uv sync --extra dev
uv run pytest

If jpype1 needs to build from source on your platform, install Apache Ant alongside Java before running uv sync.

FAQ

PDF rendering is handled by pypdfium2, which bundles its own PDFium binary — no external poppler installation is required.

OpenCV is pulled in via opencv-python-headless, so no system OpenGL libraries (libGL.so.1) are required on headless servers or containers.

Show package licenses

pip-licenses --with-urls --with-system --format=markdown

Improvements to be made:

  • implement multithreading class for zxing in java which returns proper objects for python consumption
  • extension mechanisms for other 2D barcode aggregations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docbarcodes-1.1.5.tar.gz (6.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docbarcodes-1.1.5-py3-none-any.whl (557.5 kB view details)

Uploaded Python 3

File details

Details for the file docbarcodes-1.1.5.tar.gz.

File metadata

  • Download URL: docbarcodes-1.1.5.tar.gz
  • Upload date:
  • Size: 6.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docbarcodes-1.1.5.tar.gz
Algorithm Hash digest
SHA256 459737da910372c9447b45b6e67e889c9079ea4c5658e129703b10de5621c83a
MD5 ee363de21e4b4c8bb33225256bb54dc7
BLAKE2b-256 c93fd61d249f8784fbddbd0dbca6cfeab9b2f425b5d2dcc7ef0471ce72aec63d

See more details on using hashes here.

Provenance

The following attestation bundles were made for docbarcodes-1.1.5.tar.gz:

Publisher: workflow.yml on ArlindNocaj/document-barcodes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docbarcodes-1.1.5-py3-none-any.whl.

File metadata

  • Download URL: docbarcodes-1.1.5-py3-none-any.whl
  • Upload date:
  • Size: 557.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docbarcodes-1.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 1e33aefac586fbcbd2ed800e888311560319ee00934e3e4323b6aa5010d77201
MD5 90764d9d900daff5aab50f26e4f5c81b
BLAKE2b-256 42bfecef0c9ec68777f34ba1c29813750a8d6abf03da34461bde25045b0e2d6a

See more details on using hashes here.

Provenance

The following attestation bundles were made for docbarcodes-1.1.5-py3-none-any.whl:

Publisher: workflow.yml on ArlindNocaj/document-barcodes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page