Skip to main content

Docbarcodes extracts 1D and 2D barcodes from scanned PDF documents or images.

Project description

Build and Test

Intro

Barcodes are being used in many documents or forms to enable machine reading capabilities and reduce manual processing effort. Simple 1D barcodes for example can mark a specific page on a form, or indicate a relevant document identification number. More complex 2D barcodes allow to encode even the full data of the document in a more structured form.

Docbarcodes extracts 1D and 2D barcodes from scanned PDF documents or images. It can be used to automate extraction and processing of all kind of documents.

Some working documents with barcodes are:

  • Swiss tax statements (Zurich, other cantons can be added as well)
  • Swiss salary statement
  • US drivers licenses with MRZ (machine readable zone)
  • Swiss QR Code Invoices, introduced by Six-group
  • Swiss Covid Certificates

The approach works as follows:

  1. Detect barcode regions on the document using opencv and image transformation heuristics, see https://github.com/pyxploiter/Barcode-Detection-and-Decoding
  2. Extract the raw barcode data using zxing
  3. Combine multiple barcodes and decode the data.

Quick start

Required:

  • Java 8

Install package

pip install docbarcodes

Download example pdf document

wget https://github.com/ArlindNocaj/document-barcodes/raw/master/data/salary_swissdec/SalarySinglePage.pdf 

CLI usage

Extract barcodes from salary statement document and format result json using jq.

docbarcodes ./SalarySinglePage.pdf | jq .
{
  "BarcodesRaw": [
    {
      "page": 0,
      "num_candidate": 2,
      "raw": "´cý¸z\u0000\u0002V\u0001\u0003\u0000\u0001\u0000\u0001PK\u0003\u0004\u0014\u0000\b\b\b\u0000év\u0003Q\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000txabM’Ýnâ0\u0010…_Åò}b;¡!¬\u001cWá¯EbYÔÐjÕ;\u0013Ü\u00105±+Û,ð>û&ûb;\tÐö\"Îñ™\u0019Ïç‘ùý©mÐ\u001fe]mt†YH1Rº4»ZW\u0019~Þ̃\u0014ß\u000b¾A¦]†÷Þ\u007fü äx<†îX;·SeXî‰+÷ª•ÄíHD#J£ˆ’B6Ҟ§ª„Ÿôpú攏1*\u0016Ó\fS\nmг{é$ÂOLû!õ\u0019=/¦Áx^dxò8\u000bX\u0014‡×\u000f£5 v„kyh$Z)ç\u001be1z|\nž&ÁJ¶\nj~ý\\/g¿Ñ¬QïÞ\u001a]¿£ü\u0001£×Å:ÃqÒµœ,3œo½ª›ƒ®PáÕAYPJ\u0003·Jù\fÏêJÙ£ªP\u0002=ó[s‚±D\u0003(®ý9Ûý\u0001²×{£¡#c\u0014Å)\u001aŒÐa\"ø…r1EKé¼î©rµ1š×öÓÐZÞ¨èê\u000b`ùﯮ*éœó\u0016V…¢ï$É\b\n–¦”MOs£\u001a+«Ñ\bvæ ½í\u0002‚\u0017/A^\u0004++º\u0019\u000eî’a˜Ž(\u000b£”“Ï\u0010'7bÁsÁ§¦\u0004Å8¹ˆî:µÙ\tþfM+\"ʒ€Ñ€B¼78ôª›«Ï‚˜rrqúSûÂ\u0005<¥V‰AÊX\u001a\u000e!ájð\u0007kœ»n’\u0014Æ\u0010Rˆ~wùJù«¼KâAÜ'|yœ\u0000/وÿPK\u0007\bK¦nJÎ\u0001\u0000\u0000Á\u0002\u0000\u0000PK\u0001\u0002\u0014\u0000\u0014\u0000\b\b\b\u0000év\u0003QK¦nJÎ\u0001\u0000\u0000Á\u0002\u0000\u0000\u0004\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000txabPK\u0005\u0006\u0000\u0000\u0000\u0000\u0001\u0000\u0001\u00002\u0000\u0000\u0000\u0000\u0002\u0000\u0000\u0000\u0000",
      "format": "PDF_417",
      "points": [
        [
          0.08509771986970684,
          0.2564102564102564
        ],
        [
          0.32166123778501626,
          0.2564102564102564
        ],
        [
          0.32166123778501626,
          0.35522904062229904
        ],
        [
          0.08509771986970684,
          0.35522904062229904
        ]
      ],
      "resultMetadata": {
        "ERROR_CORRECTION_LEVEL": "2",
        "PDF417_EXTRA_METADATA": {
          "addressee": "None",
          "checksum": -1,
          "fileId": "None",
          "fileName": "None",
          "fileSize": -1,
          "optionalData": "None",
          "segmentCount": -1,
          "segmentIndex": 0,
          "sender": "None",
          "timestamp": -1
        }
      }
    }
  ],
  "BarcodesCombined": [
    {
      "content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?><T xmlns=\"http://www.swissdec.ch/schema/sd/20200220/SalaryDeclarationTxAB\" SID=\"000\" SysV=\"001\"><Company UID-BFS=\"CHE-123.123.123\" Person=\"Paula Nestler\" HR-RC-Name=\"COMPLEX Elektronik AG\" ZIP=\"3600\" CL=\"Abteilung Steuerungen\" Street=\"Eigerweg 6\" Postbox=\"124\" City=\"Thun\" Phone=\"033 238 49 71\"/><PersonID Lastname=\"Aebi\" Firstname=\"Anna\" ZIP=\"3000\" CL=\"\" Street=\"Länggassstrasse 26\" Postbox=\"690\" Locality=\"\" City=\"Bern 9\" Country=\"\"><SV-AS-Nr>123.4567.8901.28</SV-AS-Nr></PersonID><A><DocID>1</DocID><Period><from>2016-10-01</from><until>2016-11-30</until></Period><Income>48118.70</Income><GrossIncome>68000.00</GrossIncome><NetIncome>56343.00</NetIncome></A></T>",
      "format": "PDF_417",
      "sources": [
        0
      ]
    }
  ]
}

Code Usage

from docbarcodes.extract import process_document

barcodes_raw, barcodes_combined = process_document("./SalarySinglePage.pdf")

print(barcodes_raw)
print(barcodes_combined)

FAQ

On Windows only: If you have problems with the installation of package dependencies, I recommend using conda to install java and poppler

conda install -y -c conda-forge jpype1=1.3.0
conda install -c conda-forge poppler=21

Show package licenses

pip-licenses --with-urls --with-system --format=markdown

Improvements to be made:

  • implement multithreading class for zxing in java which returns proper objects for python consumption
  • extension mechanisms for other 2D barcode aggregations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docbarcodes-1.0.7.tar.gz (4.2 MB view details)

Uploaded Source

Built Distribution

docbarcodes-1.0.7-py3-none-any.whl (554.9 kB view details)

Uploaded Python 3

File details

Details for the file docbarcodes-1.0.7.tar.gz.

File metadata

  • Download URL: docbarcodes-1.0.7.tar.gz
  • Upload date:
  • Size: 4.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for docbarcodes-1.0.7.tar.gz
Algorithm Hash digest
SHA256 be94e269b091336edb251701e78e8f2f75c7be67e90de23cb406cfb89336364a
MD5 6d5de759af51ba96c2725a65efcc1a5c
BLAKE2b-256 d274e8f1b0cd7ac8a110b31c58ca2d57d2f8afa16d55b1521b25d69ad9e6cb23

See more details on using hashes here.

File details

Details for the file docbarcodes-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: docbarcodes-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 554.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for docbarcodes-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 aa8460292abc3e5df3ecf1a0095000fb84520d8e501013ebab55cdd5a35424b6
MD5 802b0bfdaf92afb1966e45a008765b55
BLAKE2b-256 0f9e1e2effe7b1eddce0b66ce254d2a950855424f4023e8c666c24695e262659

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page