Skip to main content

A lightweight tool for table extraction from HTML pages.

Project description

boxfish: lightweight table extraction from HTML

PyPI PyPI - Status PyPI - License PyPI - Python Version

GitHub top language

What is it?

Boxfish is a lightweight tool for table extraction from HTML pages.

Main features

  • Easy configuration. No knowledge of CSS or Xpaths required.
  • Fast table extraction to CSV files.
  • Integration of requests and selenium.

Quick start

import boxfish as bf
import pandas as pd

# Define table layout of an url with strings from two rows.
aurl = ''
row1 = ''
row2 = ''

# Build a configuration 
aconfig = bf.build(url=aurl, astr = [row1, row2])

# Extract a table
data = bf.extract(aconfig, url=aurl)

# View results
df = pd.DataFrame(data)
df.head() 

Where to get it?

Boxfish is available on Pypi and Github.

pip install boxfish

Dependencies

The main dependencies are:

  • BeautifulSoup4, a Python library for pulling data out of HTML and XML files.
  • lxml, a powerful and Pythonic XML processing library.
  • Requests, a simple, yet elegant, HTTP library.
  • Selenium, automated web browser interaction from Python.

License

Boxfish is available with an MIT license.

Limitations

Boxfish extracts text from HTML. To see if the HTML file contains the text of interest, open the page in a browser, then access the HTML in the developer tools via Cntrl+Shift+ I.

Documentation

Full documentation is available here.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boxfish-0.0.2.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

boxfish-0.0.2-py3-none-any.whl (2.9 kB view details)

Uploaded Python 3

File details

Details for the file boxfish-0.0.2.tar.gz.

File metadata

  • Download URL: boxfish-0.0.2.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 CPython/3.9.12

File hashes

Hashes for boxfish-0.0.2.tar.gz
Algorithm Hash digest
SHA256 8b2aeb4bb4bdb29f511984510a166e228e35347139649ac6e250501bedd6d242
MD5 61ad8df910e39b9b05957da80cbaaa3f
BLAKE2b-256 fa40c4bf5d61806c7b14edb14282a8a1582d60ad9ef08248e271112afef175d7

See more details on using hashes here.

File details

Details for the file boxfish-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: boxfish-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 2.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 CPython/3.9.12

File hashes

Hashes for boxfish-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cb5afe103d3fc48258a2a7ce652db173ada42059ab31049639c47346b6bccc49
MD5 94ad2bfc3411f413e911ced3627eb282
BLAKE2b-256 ee065aaa070e29cb6be2962c37ab3293972fc55a5f171983724c33d6f64533bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page