Colibrie is a blazing fast tool to extract tables from PDFs
Project description
Colibrie
Colibrie is a blazing fast tool to extract tables from PDFs
Why Colibrie?
- Efficient: Colibrie is faster by multiple order of magnitude than any actual existing solution
- Fidel visual: Colibrie can provide 1:1 HTML representation of any tables it'll find
- Reliable: Colibri will find every valid tables without exception if the PDF is compatible with the core principle of Colibrie
- Output: Each table can be export into to multiple formats, which include :
- Pandas Dataframe.
- HTML.
Benchmark :
Some number to compare Camelot (a popular library to extract tables from PDF) and Colibrie
| Tables extracted | |||||||
|---|---|---|---|---|---|---|---|
| Times in second | camelot | colibrie | |||||
| camelot | colibrie | valid | false positive | valid | false positive | pages count | pdf file |
| 0.53 | 0.00545 | 1 | 0 | 1 | 0 | 1 | small pdf |
| 5.95 | 0.02100 | 4 | 0 | 4 | 0 | 11 | medium pdf |
| 105.00 | 0.21900 | 62 | 1 | 61 | 0 | 167 | big pdf |
| 182.00 | 0.69000 | 175 | 1 | 177 | 0 | 269 | giant pdf |
Current limitation
- Colibrie only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
- For the moment Colibrie doesn't work on PDF with tables that has no structural lines (like this one or this one ) but it can handle a few missing lines (like this one or this one)
Installation
using source
pip install poetry
git clone https://github.com/abitoun-42/colibrie.git
cd colibrie
poetry install
using pip
pip install colibrie
Usage
PDF used in example : example.pdf
from colibrie.extract_tables import extract_table
tables = extract_table('example.pdf')
for table in tables:
print(table.to_html())
df = table.to_df()
Output :
| Classifi cation des associations agréées de surveillance de la qualité de l’air | Classifi cation des bureaux d’études techniques, des cabinets d’ingénieurs-conseils et des sociétés de conseils | ||||||
| Catégorie | Échelon | Coeffi cient | Salaire minimal hiérarchique | Position | Coeffi cient | Salaire minimal hiérarchique | |
| 7 | 1 2 3 4 5 6 7 8 9 10 11 12 | 255 268 282 296 311 327 344 362 381 401 422 444 | 1 307,13 € 1 373,77 € 1 445,53 € 1 517,30 € 1 594,19 € 1 676,20 € 1 763,34 € 1 855,61 € 1 953,01 € 2 055,53 € 2 163,17 € 2 275,94 € | ETAM | 1.1. | 230 | 1 558,80 € |
| 1.2. | 240 | 1 587,50 € | |||||
| 1.3. | 250 | 1 618,50 € | |||||
| 6 | 1 2 3 4 5 6 7 8 9 10 11 12 | 310 326 344 363 384 406 430 457 485 515 549 585 | 1 589,06 € 1 671,08 € 1 763,34 € 1 860,74 € 1 968,38 € 2 081,16 € 2 204,18 € 2 342,58 € 2 486,11 € 2 639,89 € 2 814,17 € 2 998,71 € | 2.1. | 275 | 1 683,75 € | |
| 2.2. | 310 | 1 786,70 € | |||||
| 2.3. | 355 | 1 922,60 € | |||||
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
colibrie-1.2.tar.gz
(16.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
colibrie-1.2-py3-none-any.whl
(16.8 kB
view details)
File details
Details for the file colibrie-1.2.tar.gz.
File metadata
- Download URL: colibrie-1.2.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.1 CPython/3.10.0 Darwin/21.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7de64cb087290f6262fc98739fab64d4c18ae5b2d3425a9864e52dc15621de65
|
|
| MD5 |
a14fee33972f0330c9e5fe1ef0600fa7
|
|
| BLAKE2b-256 |
693d7facf4c4af10d01557b14c81d47464bb84b046758b239ca72001e71818ce
|
File details
Details for the file colibrie-1.2-py3-none-any.whl.
File metadata
- Download URL: colibrie-1.2-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.1 CPython/3.10.0 Darwin/21.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d0a12d463e8e9221759aaa0935b71e6dc52a6370079f690cd161b75a737d07a
|
|
| MD5 |
b5c6e32b178a0d28ffbf9cd6ad295d01
|
|
| BLAKE2b-256 |
cdaf240a888d78ed3d8eb89496f80f248247843b70f925a15b7caac75d5efd8e
|