Skip to main content

A python library for extracting data from html table

Project description

HTML Table Extractor

Build Status

HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table

Important links

Installation

pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor

Usage

Example 1 - Simple

12
34
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 2 - Transformer

12
34
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()

It will print out:

[[1, 2], [3, 4]]

Example 3 - Pass BS4 Tag

12
34
from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 4 - Complex

1 2 3
4
5
from html_table_extractor.extractor import Extractor
table_doc = """
<table>
  <tr>
    <td rowspan=2>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td colspan=2>4</td>
  </tr>
  <tr>
    <td colspan=3>5</td>
  </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

Example 5 - Conflicted

1 2 3
4
5
from html_table_extractor.extractor import Extractor
table_doc = """
<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td rowspan=3>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=2>5</td>
    </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

Example 6 - Write to file

12
34
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')

It will write to a given path and create a new csv file called output.csv:

1,2
3,4

Team

Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, report it here

Copyright

Copyright (c) 2017 Justin Li. Released under the MIT License

Third-party copyright in this distribution is noted where applicable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Built Distribution

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page