Skip to main content

Extract the information represented in any HTML table

Project description

Tablextract

This Python 3 library extracts the information represented in any HTML table. This project has been developed in the context of the paper TOMATE: On extracting information from HTML tables.

Some of the main features of this library are:

  • Context location: Context information is detected, both inside and outside the table.
  • Cell role detection: Classification of cells in headers and data based on the style, syntax, structure and semantics.
  • Layout detection: Automatic identification of horizontal listings, vertical listings, matrices and enumerations.
  • Record extraction: Identified tables are extracted as a list of dictionaries, each one being a database record.

Some features that will be added soon:

  • Cell correction: Analysis of the orientation of the table to fix wrongly labelled cells.
  • Totals detection: Detect totalling cells automatically from the data.

How to install

You can install this library via pip using: pip install tablextract

Usage example

>>> from pprint import pprint
>>> from tablextract import tables
>>>
>>> ts = tables('https://en.wikipedia.org/wiki/Fiji')
>>> ts
[
    Table(url=https://en.wikipedia.org/wiki/Fiji, xpath=.../div[4]/div[1]/table[2]),
    Table(url=https://en.wikipedia.org/wiki/Fiji, xpath=.../div[4]/div[1]/table[3]),
    Table(url=https://en.wikipedia.org/wiki/Fiji, xpath=.../div[4]/div[1]/table[4])
]
>>> ts[0].record
[
    {'Confederacy': 'Burebasaga', 'Chief': 'Ro Teimumu Vuikaba Kepa'},
    {'Confederacy': 'Kubuna', 'Chief': 'Vacant'},
    {'Confederacy': 'Tovata', 'Chief': 'Ratu Naiqama Tawake Lalabalavu'}
]
>>> ts[2].record  # it automatically identifies that it's laid out vertically
[
    {
        'English': 'Hello/hi',
        'Fijian': 'bula',
        'Fiji Hindi': 'नमस्ते (namaste)'
    }, {
        'English': 'Good morning',
        'Fijian': 'yadra (Pronounced Yandra)',
        'Fiji Hindi': 'सुप्रभात (suprabhat)'
    }, {
        'English': 'Goodbye',
        'Fijian': 'moce (Pronounced Mothe)',
        'Fiji Hindi': 'अलविदा (alavidā)'
    }
]

This library only have one function tables, that returns a list of Table objects.

tables(url, css_filter='table', xpath_filter=None, request_cache_time=None)

  • url: str: URL of the site where tables should be downloaded from.
  • css_filter: str: When specified, only tables that match the selector will be returned.
  • xpath_filter: str: When specified, only tables that match the XPATH selector will be returned.
  • request_cache_time: int: When specified, downloaded documents will be cached for that number of seconds.

Each Table object has the following properties:

# TODO

Changes

v1.1.*

Released on Feb 05, 2019.

  • Orientation is automatically detected to fix some table cell functions.
  • New features are extracted from the cells: POS tagging densities, relative column and row indices, first-char-type and last-char-type.
  • Hierarchical, factorised, and some periodical headers are segmented properly before the extraction.
  • Instead of discarding tables with tables inside and then discarding tables smaller than 2x2, it first removes the small tables and then discards tables with tables inside, in order to get more results.
  • Texts and images are extracted before discarding repeated cells, to avoid discarding rows with changing images.
  • Cache is disabled by default

v1.0.*

Released on Jan 24, 2019.

  • Before using Selenium, geckodriver is automatically downloaded for Linux, Windows and Mac OS.
  • The Firefox process is closed automatically when the process ends.
  • Geckodriver quit is called instead of close.
  • Side-projects has been moved from this core project to tablextract-server and datamart.
  • Fixed project imports and setup
  • More readable Table objects

v0.0.*

Released on Jan 22, 2019.

  • Initial package upload.
  • Removed side projects to tablextractserver and datamart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tablextract-1.1.5.tar.gz (15.2 kB view hashes)

Uploaded Source

Built Distribution

tablextract-1.1.5-py3-none-any.whl (23.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page