Colibrie is a blazing fast tool to extract tables from PDFs
Project description
Colibrie
Colibrie is a blazing fast tool to extract tables from PDFs
Why Colibrie?
- :rocket: Efficient: Colibrie is faster by multiple order of magnitude than any actual existing solution
- :sparkles: Fidel visual: Colibrie can provide 1:1 HTML representation of any tables it'll find
- :books: Reliable: Colibri will find every valid tables without exception if the PDF is compatible with the core principle of Colibrie
- :memo: Output: Each table can be export into to multiple formats, which include :
- Pandas Dataframe.
- HTML.
Benchmark :
Some number to compare Camelot (a popular library to extract tables from PDF) and Colibrie
Tables extracted | |||||||
---|---|---|---|---|---|---|---|
Times in second | camelot | colibrie | |||||
camelot | colibrie | valid | false positive | valid | false positive | pages count | pdf file |
0.53 | 0.00545 | 1 | 0 | 1 | 0 | 1 | small pdf |
5.95 | 0.02100 | 4 | 0 | 4 | 0 | 11 | medium pdf |
105.00 | 0.21900 | 62 | 1 | 62 | 0 | 167 | big pdf |
182.00 | 0.69000 | 175 | 1 | 177 | 0 | 269 | giant pdf |
Installation
using source
pip install poetry
git clone https://github.com/abitoun-42/colibrie.git
cd colibrie
poetry install
using pip
pip install colibrie
Usage
PDF used in example : example.pdf
from colibrie.extract_tables import extract_table
tables = extract_table('example.pdf')
for table in tables:
print(table.to_html())
df = table.to_df()
Output :
Classifi cation des associations agréées de surveillance de la qualité de l’air | Classifi cation des bureaux d’études techniques, des cabinets d’ingénieurs-conseils et des sociétés de conseils | ||||||
Catégorie | Échelon | Coeffi cient | Salaire minimal hiérarchique | Position | Coeffi cient | Salaire minimal hiérarchique | |
7 | 1 2 3 4 5 6 7 8 9 10 11 12 | 255 268 282 296 311 327 344 362 381 401 422 444 | 1 307,13 € 1 373,77 € 1 445,53 € 1 517,30 € 1 594,19 € 1 676,20 € 1 763,34 € 1 855,61 € 1 953,01 € 2 055,53 € 2 163,17 € 2 275,94 € | ETAM | 1.1. | 230 | 1 558,80 € |
1.2. | 240 | 1 587,50 € | |||||
1.3. | 250 | 1 618,50 € | |||||
6 | 1 2 3 4 5 6 7 8 9 10 11 12 | 310 326 344 363 384 406 430 457 485 515 549 585 | 1 589,06 € 1 671,08 € 1 763,34 € 1 860,74 € 1 968,38 € 2 081,16 € 2 204,18 € 2 342,58 € 2 486,11 € 2 639,89 € 2 814,17 € 2 998,71 € | 2.1. | 275 | 1 683,75 € | |
2.2. | 310 | 1 786,70 € | |||||
2.3. | 355 | 1 922,60 € |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
colibrie-1.1.3.tar.gz
(25.7 kB
view details)
Built Distribution
colibrie-1.1.3-py3-none-any.whl
(37.3 kB
view details)
File details
Details for the file colibrie-1.1.3.tar.gz
.
File metadata
- Download URL: colibrie-1.1.3.tar.gz
- Upload date:
- Size: 25.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.1 CPython/3.10.0 Darwin/21.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99990a5008a2b8028446333753db8ff3379a81da7850100c9ee01d213aa3cc7d |
|
MD5 | a93475574629071daa2cd0590dc981e0 |
|
BLAKE2b-256 | 748445da235d43ec97eaddb420ac5dffbfede69fb8feb10d8c5fe5f8bb1f4d2f |
Provenance
File details
Details for the file colibrie-1.1.3-py3-none-any.whl
.
File metadata
- Download URL: colibrie-1.1.3-py3-none-any.whl
- Upload date:
- Size: 37.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.1 CPython/3.10.0 Darwin/21.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b7a2532e4d1ea7c0158ea8b1033cff7abd2d2018b885e176d46d2e7b9eb09e3 |
|
MD5 | 7c27c651ee3590fae4fb6ea2b11c9275 |
|
BLAKE2b-256 | 53555f813971632a65e72a937d07702bf6e9687447407bd64f6bafecf7208243 |