Skip to main content

Colibrie is a blazing fast tool to extract tables from PDFs

Project description

Colibrie

image image

Colibrie is a blazing fast tool to extract tables from PDFs

Why Colibrie?

  • :rocket: Efficient: Colibrie is faster by multiple order of magnitude than any actual existing solution
  • :sparkles: Fidel visual: Colibrie can provide 1:1 HTML representation of any tables it'll find
  • :books: Reliable: Colibri will find every valid tables without exception if the PDF is compatible with the core principle of Colibrie
  • :memo: Output: Each table can be export into to multiple formats, which include :
    • Pandas Dataframe.
    • HTML.

Benchmark :

Some number to compare Camelot (a popular library to extract tables from PDF) and Colibrie

Tables extracted
Times in second camelot colibrie
camelot colibrie valid false positive valid false positive pages count pdf file
0.53 0.00545 1 0 1 0 1 small pdf
5.95 0.02100 4 0 4 0 11 medium pdf
105.00 0.21900 62 1 62 0 167 big pdf
182.00 0.69000 175 1 177 0 269 giant pdf

Installation

using source

pip install poetry

git clone https://github.com/abitoun-42/colibrie.git

cd colibrie

poetry install

using pip

pip install colibrie

Usage

PDF used in example : example.pdf

from colibrie.extract_tables import extract_table

tables = extract_table('example.pdf')

for table in tables:
   print(table.to_html())
   df = table.to_df()

Output :

Classifi cation des associations agréées de surveillance
de la qualité de l’air
Classifi cation des bureaux d’études techniques,
des cabinets d’ingénieurs-conseils
et des sociétés de conseils
Catégorie
Échelon
Coeffi cient
Salaire
minimal
hiérarchique
Position
Coeffi cient
Salaire
minimal
hiérarchique
7
1
2
3
4
5
6
7
8
9
10
11
12
255
268
282
296
311
327
344
362
381
401
422
444
1 307,13 €
1 373,77 €
1 445,53 €
1 517,30 €
1 594,19 €
1 676,20 €
1 763,34 €
1 855,61 €
1 953,01 €
2 055,53 €
2 163,17 €
2 275,94 €
ETAM
1.1.
230
1 558,80 €
1.2.
240
1 587,50 €
1.3.
250
1 618,50 €
6
1
2
3
4
5
6
7
8
9
10
11
12
310
326
344
363
384
406
430
457
485
515
549
585
1 589,06 €
1 671,08 €
1 763,34 €
1 860,74 €
1 968,38 €
2 081,16 €
2 204,18 €
2 342,58 €
2 486,11 €
2 639,89 €
2 814,17 €
2 998,71 €
2.1.
275
1 683,75 €
2.2.
310
1 786,70 €
2.3.
355
1 922,60 €

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

colibrie-1.1.3.tar.gz (25.7 kB view details)

Uploaded Source

Built Distribution

colibrie-1.1.3-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file colibrie-1.1.3.tar.gz.

File metadata

  • Download URL: colibrie-1.1.3.tar.gz
  • Upload date:
  • Size: 25.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.1 CPython/3.10.0 Darwin/21.6.0

File hashes

Hashes for colibrie-1.1.3.tar.gz
Algorithm Hash digest
SHA256 99990a5008a2b8028446333753db8ff3379a81da7850100c9ee01d213aa3cc7d
MD5 a93475574629071daa2cd0590dc981e0
BLAKE2b-256 748445da235d43ec97eaddb420ac5dffbfede69fb8feb10d8c5fe5f8bb1f4d2f

See more details on using hashes here.

Provenance

File details

Details for the file colibrie-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: colibrie-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 37.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.1 CPython/3.10.0 Darwin/21.6.0

File hashes

Hashes for colibrie-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5b7a2532e4d1ea7c0158ea8b1033cff7abd2d2018b885e176d46d2e7b9eb09e3
MD5 7c27c651ee3590fae4fb6ea2b11c9275
BLAKE2b-256 53555f813971632a65e72a937d07702bf6e9687447407bd64f6bafecf7208243

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page