A python library for extracting data from html table

These details have not been verified by PyPI

Project links

Homepage

Project description

HTML Table Extractor

Note: This is a re-release of html-table-extractor of yuanxu-li, existing just because I've been waiting for too long for an actual release to fix the incorrect dependency (pipenv would refuse to install new version of BeautifulSoup using the original version 1.4.0). I've kept changes to a minimum, just to add this notice, fix setup.py to make it PyPI friendly, and change the PyPI package name.

HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table

Important links

Repository: https://github.com/yuanxu-li/html-table-extractor
Issues: https://github.com/yuanxu-li/html-table-extractor/issues

Installation

pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor

Usage

Example 1 - Simple

1	2
3	4

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 2 - Transformer

1	2
3	4

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()

It will print out:

[[1, 2], [3, 4]]

Example 3 - Pass BS4 Tag

1	2
3	4

from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 4 - Complex

1	2	3
	4
5

from html_table_extractor.extractor import Extractor
table_doc = """
<table>
  <tr>
    <td rowspan=2>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td colspan=2>4</td>
  </tr>
  <tr>
    <td colspan=3>5</td>
  </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

Example 5 - Conflicted

1	2	3
	4
5

from html_table_extractor.extractor import Extractor
table_doc = """
<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td rowspan=3>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=2>5</td>
    </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

Example 6 - Write to file

1	2
3	4

from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')

It will write to a given path and create a new csv file called output.csv:

1,2
3,4

Team

@yuanxu-li

Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, report it here

Copyright

Third-party copyright in this distribution is noted where applicable.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.4.0.1

Dec 14, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isaacto-html-table-extractor-1.4.0.1.tar.gz (4.0 kB view details)

Uploaded Dec 14, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

isaacto_html_table_extractor-1.4.0.1-py3-none-any.whl (5.1 kB view details)

Uploaded Dec 14, 2019 Python 3

File details

Details for the file isaacto-html-table-extractor-1.4.0.1.tar.gz.

File metadata

Download URL: isaacto-html-table-extractor-1.4.0.1.tar.gz
Upload date: Dec 14, 2019
Size: 4.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.5.3

File hashes

Hashes for isaacto-html-table-extractor-1.4.0.1.tar.gz
Algorithm	Hash digest
SHA256	`d4bb3711118c2bee561b4e52ea3b0795eee9985894c826587d0e87dfd00dae5e`
MD5	`c99896daead890070e55bd2801a38122`
BLAKE2b-256	`6f7f067d4740f75b8146161c01d96bcc3b0fce888c9143abb6fd8d9eb9b9c17f`

See more details on using hashes here.

File details

Details for the file isaacto_html_table_extractor-1.4.0.1-py3-none-any.whl.

File metadata

Download URL: isaacto_html_table_extractor-1.4.0.1-py3-none-any.whl
Upload date: Dec 14, 2019
Size: 5.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.5.3

File hashes

Hashes for isaacto_html_table_extractor-1.4.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1388a02354d7b15d27e495864e49c504dc28707ee0a9f1621c636502b9e1f890`
MD5	`d04ec3e6556f2f30a7b687f982d32c88`
BLAKE2b-256	`8725e64c1beb6ad194485adcd3447366816537797dbe331e89ac1868dcb1ddf6`

See more details on using hashes here.

isaacto-html-table-extractor 1.4.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HTML Table Extractor

Important links

Installation

Usage

Example 1 - Simple

Example 2 - Transformer

Example 3 - Pass BS4 Tag

Example 4 - Complex

Example 5 - Conflicted

Example 6 - Write to file

Team

Errors/ Bugs

Copyright

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes