Skip to main content

HTML table parser that supports rowspan, colspan, links and nested tables. Fast, lightweight with no external dependencies.

Project description

HTML Table Takeout

Test

HTML Table Takeout project logo

A fast, lightweight HTML table parser that supports rowspan, colspan, links and nested tables. No external dependencies are needed.

The input may be text, a URL or local file Path.

HTML5 logo by W3C.

Quick Start

Install the package:

pip install html-table-takeout

Pass in a URL and print out the parsed Table as CSV:

from html_table_takeout import parse_html

# start with http:// or https:// to source from a URL
tables = parse_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')

print(tables[0].to_csv())

# output:
# Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
# MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,0000066740,1902
# ...

Pass in HTML text and print out the parsed Table as valid HTML:

from html_table_takeout import parse_html

tables = parse_html("""
<table>
    <tr>
        <td rowspan='2'>1</td> <!-- rowspan will be expanded -->
        <td>2</td>
    </tr>
    <tr>
        <td>3</td>
    </tr>
</table>""")

print(tables[0].to_html(indent=4))

# output:
# <table data-table-id='0'>
# <tbody>
#     <tr>
#         <td>1</td>
#         <td>2</td>
#     </tr>
#     <tr>
#         <td>1</td>
#         <td>3</td>
#     </tr>
# </tbody>
# </table>

Usage

The core parse_html() function returns a list of zero or more top-level Table. A Table is guaranteed to have this structure:

  • rows: List of one or more TRow
    • cells: List of zero or more TCell resulting from rowspan and colspan expansion
      • elements: List of zero or more TText, TLink, TRef
Type Description
Table Each parsed table has an auto-assigned unique id
TRow Equal to each <tr> in the original table
TCell Expanded <td> or <th> cells from row/colspan
TText HTML-decoded text inside <td> or <th>
TLink Equal to each <a> inside <td> or <th>
TRef Reference to the child Table

All tables are guaranteed to have at least one TRow containing one TCell.

The parse_html() function also provides filtering by text or attributes to target the tables you want. Check out its docstring for all options.

Why did you make this

Most HTML table parsers require extra DOM and data processing libraries that aren't needed for my application. I need a parser that handles nesting and gives me the flexibility to process the parsed result however I want.

Now you too can take out tables to go.

Developing

Install development dependencies:

pip install build mypy pytest

Run tests:

pytest

Build the package:

python -m build

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_table_takeout-1.1.2.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html_table_takeout-1.1.2-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file html_table_takeout-1.1.2.tar.gz.

File metadata

  • Download URL: html_table_takeout-1.1.2.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for html_table_takeout-1.1.2.tar.gz
Algorithm Hash digest
SHA256 e2021e7c93271d2c08c1c2fb09f8408f6aac36d353e37cea0a196f3fe54e7f30
MD5 11b38be691d6f0739db091561494c543
BLAKE2b-256 e302e2e9ef488b6400ea8bcd3ce95e5ec67e8be3368f6014ad014017072dcec9

See more details on using hashes here.

File details

Details for the file html_table_takeout-1.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for html_table_takeout-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e590c8a41b455722b8eac89ecadb3a1560a563994659b9186f50839ef2ba3238
MD5 6f9c72d6f70c29603f28d43e133546da
BLAKE2b-256 eec803180b73e9928014c7ebc4f9f0b89eac5f1278813155e22f98d5b3b6f081

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page