Colibrie is a blazing fast tool to extract tables from PDFs
Project description
Colibrie
Colibrie is a blazing fast tool to extract tables from PDFs
Why Colibrie?
- Efficient: Colibrie is faster by multiple order of magnitude than any actual existing solution
- Fidel visual: Colibrie can provide 1:1 HTML representation of any tables it'll find
- Reliable: Colibri will find every valid tables without exception if the PDF is compatible with the core principle of Colibrie
- Output: Each table can be export into to multiple formats, which include :
- Pandas Dataframe.
- HTML.
Benchmark :
Some number to compare Camelot (a popular library to extract tables from PDF) and Colibrie
Tables extracted | |||||||
---|---|---|---|---|---|---|---|
Times in second | camelot | colibrie | |||||
camelot | colibrie | valid | false positive | valid | false positive | pages count | pdf file |
0.53 | 0.00545 | 1 | 0 | 1 | 0 | 1 | small pdf |
5.95 | 0.02100 | 4 | 0 | 4 | 0 | 11 | medium pdf |
105.00 | 0.21900 | 62 | 1 | 61 | 0 | 167 | big pdf |
182.00 | 0.69000 | 175 | 1 | 177 | 0 | 269 | giant pdf |
Current limitation
- Colibrie only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
- For the moment Colibrie doesn't work on PDF with tables that has no structural lines (like this one or this one ) but it can handle a few missing lines (like this one or this one)
Installation
using source
pip install poetry
git clone https://github.com/abitoun-42/colibrie.git
cd colibrie
poetry install
using pip
pip install colibrie
Usage
PDF used in example : example.pdf
from colibrie.extract_tables import extract_table
tables = extract_table('example.pdf')
for table in tables:
print(table.to_html())
df = table.to_df()
Output :
Classifi cation des associations agréées de surveillance de la qualité de l’air | Classifi cation des bureaux d’études techniques, des cabinets d’ingénieurs-conseils et des sociétés de conseils | ||||||
Catégorie | Échelon | Coeffi cient | Salaire minimal hiérarchique | Position | Coeffi cient | Salaire minimal hiérarchique | |
7 | 1 2 3 4 5 6 7 8 9 10 11 12 | 255 268 282 296 311 327 344 362 381 401 422 444 | 1 307,13 € 1 373,77 € 1 445,53 € 1 517,30 € 1 594,19 € 1 676,20 € 1 763,34 € 1 855,61 € 1 953,01 € 2 055,53 € 2 163,17 € 2 275,94 € | ETAM | 1.1. | 230 | 1 558,80 € |
1.2. | 240 | 1 587,50 € | |||||
1.3. | 250 | 1 618,50 € | |||||
6 | 1 2 3 4 5 6 7 8 9 10 11 12 | 310 326 344 363 384 406 430 457 485 515 549 585 | 1 589,06 € 1 671,08 € 1 763,34 € 1 860,74 € 1 968,38 € 2 081,16 € 2 204,18 € 2 342,58 € 2 486,11 € 2 639,89 € 2 814,17 € 2 998,71 € | 2.1. | 275 | 1 683,75 € | |
2.2. | 310 | 1 786,70 € | |||||
2.3. | 355 | 1 922,60 € |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
colibrie-1.2.tar.gz
(16.3 kB
view details)
Built Distribution
colibrie-1.2-py3-none-any.whl
(16.8 kB
view details)
File details
Details for the file colibrie-1.2.tar.gz
.
File metadata
- Download URL: colibrie-1.2.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.1 CPython/3.10.0 Darwin/21.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7de64cb087290f6262fc98739fab64d4c18ae5b2d3425a9864e52dc15621de65 |
|
MD5 | a14fee33972f0330c9e5fe1ef0600fa7 |
|
BLAKE2b-256 | 693d7facf4c4af10d01557b14c81d47464bb84b046758b239ca72001e71818ce |
File details
Details for the file colibrie-1.2-py3-none-any.whl
.
File metadata
- Download URL: colibrie-1.2-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.1 CPython/3.10.0 Darwin/21.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d0a12d463e8e9221759aaa0935b71e6dc52a6370079f690cd161b75a737d07a |
|
MD5 | b5c6e32b178a0d28ffbf9cd6ad295d01 |
|
BLAKE2b-256 | cdaf240a888d78ed3d8eb89496f80f248247843b70f925a15b7caac75d5efd8e |