Extract the information represented in any HTML table as database-like records

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Tablextract

This Python 3 library extracts the information represented in any HTML table. This project has been developed in the context of the paper TOMATE: On extracting information from HTML tables.

Some of the main features of this library are:

Context location: Context information is detected, both inside and outside the table.
Cell role detection: Classification of cells in headers and data based on the style, syntax, structure and semantics.
Layout detection: Automatic identification of horizontal listings, vertical listings, matrices and enumerations.
Record extraction: Identified tables are extracted as a list of dictionaries, each one being a database record.

Some features that will be added soon:

Cell correction: Analysis of the orientation of the table to fix wrongly labelled cells.
Totals detection: Detect totalling cells automatically from the data.

How to install

You can install this library via pip using: pip install tablextract

Usage example

>>> from pprint import pprint
>>> from tablextract import tables
>>>
>>> ts = tables('https://en.wikipedia.org/wiki/Fiji')
>>> ts
[
    Table(url=https://en.wikipedia.org/wiki/Fiji, xpath=.../div[4]/div[1]/table[2]),
    Table(url=https://en.wikipedia.org/wiki/Fiji, xpath=.../div[4]/div[1]/table[3]),
    Table(url=https://en.wikipedia.org/wiki/Fiji, xpath=.../div[4]/div[1]/table[4])
]
>>> ts[0].record
[
    {'Confederacy': 'Burebasaga', 'Chief': 'Ro Teimumu Vuikaba Kepa'},
    {'Confederacy': 'Kubuna', 'Chief': 'Vacant'},
    {'Confederacy': 'Tovata', 'Chief': 'Ratu Naiqama Tawake Lalabalavu'}
]
>>> ts[2].record  # it automatically identifies that it's laid out vertically
[
    {
        'English': 'Hello/hi',
        'Fijian': 'bula',
        'Fiji Hindi': 'नमस्ते (namaste)'
    }, {
        'English': 'Good morning',
        'Fijian': 'yadra (Pronounced Yandra)',
        'Fiji Hindi': 'सुप्रभात (suprabhat)'
    }, {
        'English': 'Goodbye',
        'Fijian': 'moce (Pronounced Mothe)',
        'Fiji Hindi': 'अलविदा (alavidā)'
    }
]

This library only have one function tables, that returns a list of Table objects.

tables(url, css_filter='table', xpath_filter=None, request_cache_time=None)

url: str: URL of the site where tables should be downloaded from.
css_filter: str: When specified, only tables that match the selector will be returned.
xpath_filter: str: When specified, only tables that match the XPATH selector will be returned.
request_cache_time: int: When specified, downloaded documents will be cached for that number of seconds.

Each Table object has the following properties and methods:

cols(): int: Number of columns of the table.
rows(): int: Number of rows of the table.
cells(): int: Number of cells of the table (same as table.cols() * table.rows()).
error: str or None: If an error has occurred during table extraction, it contains the stacktrace of it. Otherwise, it is None.
url: str: URL of the page from where the table was extracted.
xpath: str: XPath of the table within the page.
element: bs4.element.Tag: BeautifulSoup element that represents the table.
elements: list of list of bs4.element.Tag: 2D table of BeautifulSoup elements that represents the table after cell segmentation.
texts: list of list of str: 2D table of strings that represents the text of each cell.
context: dict of {tuple, str}: Texts inside or outside the table that provides contextual information for it. The keys of the dictionary represents the context position.
features: list of list of dict of {str, float/str}: 2D table of feature vectors for each cell in the table.
functions: list of list of int: 2D table of functions of the cells of the table. Functions can be EMPTY (-1), DATA (0), or METADATA(1).
kind: str: Type of table extracted. Types can be 'horizontal listing', 'vertical listing', 'matrix', 'enumeration' or 'unknown'.
record: list of dict of {str, str}: Database-like records extracted from the table.
score: float: Estimation of how properly the table was extracted, between 0 and 1, being 1 a perfect extraction.

Notes

If you update this library and you get the error sre_constants.error: bad escape \p at position 257, you might be using a corrupted environment. You can either:

Try to fix your current environment by forcing the download of SpaCy models: python3 -m spacy download en
Create a new environment to work with: python3 -m venv my_new_env, source my_new_env/bin/activate

Changes

v1.2

Released on Mar 25, 2019.

Named entity detection is not performed during feature extraction stage.
Table cleaning bug: tables with last row empty are not extracted.
Removed Wikipedia-specific selector constraint
The previous and next non-inline tags with text relative to the table is extracted as context.
The hierarchy of header tags h1-h6 is extracted as context.
More tables are extracted on the location stage.
Repeated headers and hierarchic headers are more clear.

v1.1

Released on Feb 05, 2019.

Orientation is automatically detected to fix some table cell functions.
New features are extracted from the cells: POS tagging densities, relative column and row indices, first-char-type and last-char-type.
Hierarchical, factorised, and some periodical headers are segmented properly before the extraction.
Instead of discarding tables with tables inside and then discarding tables smaller than 2x2, it first removes the small tables and then discards tables with tables inside, in order to get more results.
Texts and images are extracted before discarding repeated cells, to avoid discarding rows with changing images.
Cache is disabled by default
Readme documentation improved.

v1.0

Released on Jan 24, 2019.

Before using Selenium, geckodriver is automatically downloaded for Linux, Windows and Mac OS.
The Firefox process is closed automatically when the process ends.
Geckodriver quit is called instead of close.
Side-projects has been moved from this core project to tablextract-server and datamart.
Fixed project imports and setup
More readable Table objects

v0.0.

Released on Jan 22, 2019.

Initial package upload.
Removed side projects to tablextractserver and datamart

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.4.5

Mar 9, 2020

1.4.4

Mar 9, 2020

1.4.3

Mar 9, 2020

1.4.2

Feb 26, 2020

1.4.1

Feb 26, 2020

1.3.1

Jun 21, 2019

1.3.0

Jun 21, 2019

1.2.6

Apr 23, 2019

1.2.5

Apr 2, 2019

1.2.4

Apr 2, 2019

1.2.3

Mar 27, 2019

This version

1.2.2

Mar 26, 2019

1.2.1

Mar 26, 2019

1.2.0

Mar 25, 2019

1.1.10

Feb 28, 2019

1.1.9

Feb 19, 2019

1.1.8

Feb 16, 2019

1.1.7

Feb 15, 2019

1.1.6

Feb 14, 2019

1.1.5

Feb 14, 2019

1.1.4

Feb 14, 2019

1.1.3

Feb 14, 2019

1.1.2

Feb 14, 2019

1.1.1

Feb 14, 2019

1.1.0

Feb 14, 2019

1.0.18

Jan 26, 2019

1.0.17

Jan 26, 2019

1.0.16

Jan 26, 2019

1.0.15

Jan 26, 2019

1.0.13

Jan 25, 2019

1.0.11

Jan 25, 2019

1.0.10

Jan 25, 2019

1.0.9

Jan 25, 2019

1.0.8

Jan 25, 2019

1.0.7

Jan 25, 2019

1.0.6

Jan 25, 2019

1.0.5

Jan 25, 2019

1.0.3

Jan 25, 2019

1.0.2

Jan 25, 2019

1.0.1

Jan 25, 2019

1.0.0

Jan 25, 2019

0.0.1

Jan 24, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tablextract-1.2.2.tar.gz (17.0 kB view details)

Uploaded Mar 26, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tablextract-1.2.2-py3-none-any.whl (24.8 kB view details)

Uploaded Mar 26, 2019 Python 3

File details

Details for the file tablextract-1.2.2.tar.gz.

File metadata

Download URL: tablextract-1.2.2.tar.gz
Upload date: Mar 26, 2019
Size: 17.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for tablextract-1.2.2.tar.gz
Algorithm	Hash digest
SHA256	`08158e2dbbf4cafd37c30c59ae560d21ec4a322957898f76d800ce8646d289d6`
MD5	`bed86e7299a87743715e5250adced150`
BLAKE2b-256	`602e0d20744f6507c64b62697551a72914b1cf495cf4846730fa2ce1b2dee706`

See more details on using hashes here.

File details

Details for the file tablextract-1.2.2-py3-none-any.whl.

File metadata

Download URL: tablextract-1.2.2-py3-none-any.whl
Upload date: Mar 26, 2019
Size: 24.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for tablextract-1.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`944a6280dad62d7a3f6cbdcf3e26d4ba0b07943c495cad7570d2ea1f5dfca8ba`
MD5	`173c1a20757dde5c13a1c891ef28d30d`
BLAKE2b-256	`96bcca4857796dd024e8aa4df01781805cb2401628f644466edaa05de5cd667f`

See more details on using hashes here.

tablextract 1.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Tablextract

How to install

Usage example

Notes

Changes

v1.2

v1.1

v1.0

v0.0.

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes