A tool to extract tables from documents using fuzzy matching
Project description
Fuzzy table extractor
Introduction
This project aims to help data extraction from unstructured sources, like Word and pdf files, web documents, and so on.
The library has 2 main components: the file handler, which is responsible for identifying tables in the document and returning these in a more standarlized way; the extractor, which searches in document's tables and returns the one with the highest proximity, using for this a fuzzy string comparison algorithm.
Currently, there is only a handler for Docx files, but in the future, this will be expanded to other sources.
Installation
The library is available on PyPI:
pip install fuzzy-table-extractor
Using the library
Extracting tables
The usage of the library is very simple: first, a handler for the file must be created, then this object is used to create an instance of Extractor, which will contain methods for data extraction.
Here is an example of table extraction for a very simple document:
from pathlib import Path
from fuzzy_table_extractor.handlers.docx_handler import DocxHandler
from fuzzy_table_extractor.extractor import Extractor
file_path = Path(r"path_to_document.docx")
handler = DocxHandler(file_path)
extractor = Extractor(handler)
df = extractor.extract_closest_table(search_headers=["id", "name", "age"])
print(df)
For a document that looks like this:
The output is:
id name age
0 0 Paul 25
1 1 John 32
Due to the fuzzy match used to select the closest table, even though the search headers do not exactly match a table header in the document, the extraction will return the right table if this is the closest to the search, which also makes the extraction resilient to typos. As an example, using the same code above, but now for a document like this:
The output is:
id name age
0 0 Paul 25
1 1 John 32
2 2 Bob 56
Extracting single field
There is also the possibility to extract only a single field (cell) from a document. Here is an example of how to do this with the library:
from pathlib import Path
from fuzzy_table_extractor.handlers.docx_handler import DocxHandler
from fuzzy_table_extractor.extractor import Extractor, FieldOrientation
file_path = Path(r"path_to_document.docx")
handler = DocxHandler(file_path)
extractor = Extractor(handler)
area = extractor.extract_single_field(field="area",
orientation=FieldOrientation.ROW)
print(area)
For a document like this:
The output is:
430.9 km2
The file examples.py contains other examples of how to use the library
TODO
- Add to README a guide on how to contribute to project
- Expand test coverage
- Create a handler for pdf files
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fuzzy-table-extractor-0.2.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1d700ac1b731f712b5fbbbc00da36bc30a5d23d434a0ac52ce66f94afe1b89b |
|
MD5 | 5a08f7c5a653edb6959564a1ef132e47 |
|
BLAKE2b-256 | 81a98dc922be0e7040d36d9a38be527263f4bea9f9274a183d04c6b010adef1a |
Hashes for fuzzy_table_extractor-0.2.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f08a8fbc89086734ec063635f52ead989a3f450e6781d04fa52823625797cb2 |
|
MD5 | 715b66f3eadc1c216127081b3467b672 |
|
BLAKE2b-256 | cb1d7a6f17d32b52bcf04a33594456c613b80c3aecfffe5517d026624d6251be |