Skip to main content

A lightweight tool for table extraction from HTML pages.

Project description

boxfish: lightweight table extraction from HTML

PyPI PyPI - Status PyPI - License PyPI - Python Version

GitHub top language Code Style

What is it?

Boxfish is a lightweight tool for table extraction from HTML pages.

Main features

  • Easy configuration. No knowledge of CSS or Xpaths required.
  • Fast table extraction to CSV files.
  • Integration of requests and selenium.

Quick start

import boxfish as bf
import pandas as pd

# Define table layout of an url with strings from two rows.
aurl = ""
row1 = ""
row2 = ""

# Build a configuration 
aconfig = bf.build(url=aurl, rows = [row1, row2])

# Extract a table
data = bf.extract(aconfig, url=aurl)

# View results
df = pd.DataFrame(data)
df.head() 

Where to get it?

Boxfish is available on Pypi and Github.

pip install boxfish

Dependencies

The main dependencies are:

  • BeautifulSoup4, a Python library for pulling data out of HTML and XML files.
  • lxml, a powerful and Pythonic XML processing library.
  • Requests, a simple, yet elegant, HTTP library.
  • Selenium, automated web browser interaction from Python.

License

Boxfish is available with an MIT license.

Limitations

Boxfish extracts text from HTML. To see if the HTML file contains the text of interest, open the page in a browser, then access the HTML in the developer tools via Cntrl+Shift+ I.

Documentation

Full documentation is available here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boxfish-0.1.2.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

boxfish-0.1.2-py3-none-any.whl (30.7 kB view details)

Uploaded Python 3

File details

Details for the file boxfish-0.1.2.tar.gz.

File metadata

  • Download URL: boxfish-0.1.2.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for boxfish-0.1.2.tar.gz
Algorithm Hash digest
SHA256 623cd4d507f255e9299b80ae0a3ff8d8b52388245b86cab102c55b08968c1152
MD5 49a2e27fb32bb9060003509c9a620d93
BLAKE2b-256 e61edf51a537cca5bb1facda648e836f029ee923cfa743734cc6da6fa9cdc9d5

See more details on using hashes here.

File details

Details for the file boxfish-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: boxfish-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 30.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for boxfish-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0a6958437343290d653f3bb07b595d55d2e4b75f4770b562f643806f8270618b
MD5 4079d37d1c28ee525f2bef1cfd1f5607
BLAKE2b-256 6da95ad8e613959e0cfa335dcefb6dbf810606eb2751d9f34bc3486975e2da25

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page