Skip to main content

A lightweight tool for table extraction from HTML pages.

Project description

boxfish: lightweight table extraction from HTML

PyPI PyPI - Status PyPI - License PyPI - Python Version

GitHub top language Code Style

What is it?

Boxfish is a lightweight tool for table extraction from HTML pages.

Main features

  • Easy configuration. No knowledge of CSS or Xpaths required.
  • Fast table extraction to CSV files.
  • Integration of requests and selenium.

Quick start

import boxfish as bf
import pandas as pd

# Define table layout of an url with strings from two rows.
aurl = ""
row1 = ""
row2 = ""

# Build a configuration 
aconfig = bf.build(url=aurl, rows = [row1, row2])

# Extract a table
data = bf.extract(aconfig, url=aurl)

# View results
df = pd.DataFrame(data)
df.head() 

Where to get it?

Boxfish is available on Pypi and Github.

pip install boxfish

Dependencies

The main dependencies are:

  • BeautifulSoup4, a Python library for pulling data out of HTML and XML files.
  • lxml, a powerful and Pythonic XML processing library.
  • Requests, a simple, yet elegant, HTTP library.
  • Selenium, automated web browser interaction from Python.

License

Boxfish is available with an MIT license.

Limitations

Boxfish extracts text from HTML. To see if the HTML file contains the text of interest, open the page in a browser, then access the HTML in the developer tools via Cntrl+Shift+ I.

Documentation

Full documentation is available here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boxfish-0.1.2.tar.gz (27.6 kB view hashes)

Uploaded Source

Built Distribution

boxfish-0.1.2-py3-none-any.whl (30.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page