Colibrie is a blazing fast tool to extract tables from PDFs
Project description
Colibrie
Colibrie is a blazing fast tool to extract tables from PDFs
Why Colibrie?
- Efficient: Colibrie is faster by multiple order of magnitude than any actual existing solution
- Fidel visual: Colibrie can provide 1:1 HTML representation of any tables it'll find
- Reliable: Colibri will find every valid tables without exception if the PDF is compatible with the core principle of Colibrie
- Output: Each table can be export into to multiple formats, which include :
- Pandas Dataframe.
- HTML.
Benchmark :
Some number to compare Camelot (a popular library to extract tables from PDF) and Colibrie
Tables extracted | |||||||
---|---|---|---|---|---|---|---|
Times in second | camelot | colibrie | |||||
camelot | colibrie | valid | false positive | valid | false positive | pages count | pdf file |
0.53 | 0.00545 | 1 | 0 | 1 | 0 | 1 | small pdf |
5.95 | 0.02100 | 4 | 0 | 4 | 0 | 11 | medium pdf |
105.00 | 0.21900 | 62 | 1 | 62 | 0 | 167 | big pdf |
182.00 | 0.69000 | 175 | 1 | 177 | 0 | 269 | giant pdf |
Current limitation
- Colibrie only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
- For the moment Colibrie doesn't work on PDF with tables that has no structural lines (like this one or this one ) but it can handle a few missing lines (like this one or this one)
Installation
using source
pip install poetry
git clone https://github.com/abitoun-42/colibrie.git
cd colibrie
poetry install
using pip
pip install colibrie
Usage
PDF used in example : example.pdf
from colibrie.extract_tables import extract_table
tables = extract_table('example.pdf')
for table in tables:
print(table.to_html())
df = table.to_df()
Output :
Classifi cation des associations agréées de surveillance de la qualité de l’air | Classifi cation des bureaux d’études techniques, des cabinets d’ingénieurs-conseils et des sociétés de conseils | ||||||
Catégorie | Échelon | Coeffi cient | Salaire minimal hiérarchique | Position | Coeffi cient | Salaire minimal hiérarchique | |
7 | 1 2 3 4 5 6 7 8 9 10 11 12 | 255 268 282 296 311 327 344 362 381 401 422 444 | 1 307,13 € 1 373,77 € 1 445,53 € 1 517,30 € 1 594,19 € 1 676,20 € 1 763,34 € 1 855,61 € 1 953,01 € 2 055,53 € 2 163,17 € 2 275,94 € | ETAM | 1.1. | 230 | 1 558,80 € |
1.2. | 240 | 1 587,50 € | |||||
1.3. | 250 | 1 618,50 € | |||||
6 | 1 2 3 4 5 6 7 8 9 10 11 12 | 310 326 344 363 384 406 430 457 485 515 549 585 | 1 589,06 € 1 671,08 € 1 763,34 € 1 860,74 € 1 968,38 € 2 081,16 € 2 204,18 € 2 342,58 € 2 486,11 € 2 639,89 € 2 814,17 € 2 998,71 € | 2.1. | 275 | 1 683,75 € | |
2.2. | 310 | 1 786,70 € | |||||
2.3. | 355 | 1 922,60 € |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
colibrie-1.1.3.1.tar.gz
(15.1 kB
view details)
Built Distribution
File details
Details for the file colibrie-1.1.3.1.tar.gz
.
File metadata
- Download URL: colibrie-1.1.3.1.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.1 CPython/3.10.0 Darwin/21.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8440a6d7a0a4792050e8214f4e8fd1935e1f0b325b5a931a3a66d00a86950c56 |
|
MD5 | 34a8f5418ac7264e737cdb7883e69a20 |
|
BLAKE2b-256 | 1c80b6171e81446cc32adc2f06875d11306ec0599a0f6af90b780c31dcd42013 |
Provenance
File details
Details for the file colibrie-1.1.3.1-py3-none-any.whl
.
File metadata
- Download URL: colibrie-1.1.3.1-py3-none-any.whl
- Upload date:
- Size: 15.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.1 CPython/3.10.0 Darwin/21.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d6024b0bd5799913614d3895215e524ad8218c9f364c407074648bd520febf2 |
|
MD5 | 463ace9dffee93480acc2927227a986c |
|
BLAKE2b-256 | c470b2e0575ff4181dbac393d3826dfa384c2c7e72c237cc18619cea8cf97a92 |