Skip to main content

A python library for extracting data from html table

Project description

HTML Table Extractor

Build Status

HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table

Important links

Installation

pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor

Usage

Example 1 - Simple

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 2 - Transformer

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()

It will print out:

[[1, 2], [3, 4]]

Example 3 - Pass BS4 Tag

from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 4 - Complex

from html_table_extractor.extractor import Extractor
table_doc = """
<table>
  <tr>
    <td rowspan=2>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td colspan=2>4</td>
  </tr>
  <tr>
    <td colspan=3>5</td>
  </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

Example 5 - Conflicted

from html_table_extractor.extractor import Extractor
table_doc = """
<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td rowspan=3>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=2>5</td>
    </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

Example 6 - Write to file

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')

It will write to a given path and create a new csv file called output.csv:

1,2
3,4

Team

Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, report it here

Copyright

Copyright (c) 2017 Justin Li. Released under the MIT License

Third-party copyright in this distribution is noted where applicable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for html-table-extractor, version 1.4.1
Filename, size File type Python version Upload date Hashes
Filename, size html_table_extractor-1.4.1-py2.py3-none-any.whl (4.8 kB) File type Wheel Python version py2.py3 Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page