Skip to main content

Scrape HTML automatically

Project description

CI status PyPI version PyPI python version

mlscraper allows you to extract structured data from HTML automatically instead of manually specifying nodes or css selectors. You train it by providing a few examples of your desired output. It will then figure out the extraction rules for you automatically and afterwards you’ll be able to extract data from any new page you provide.

Image showing how mlscraper turns html into data objects

Background Story

Many services for crawling and scraping automation allow you to select data in a browser and get JSON results in return. No need to specify CSS selectors or anything else.

I’ve been wondering for a long time why there’s no Open Source solution that does something like this. So here’s my attempt at creating a python library to enable automatic scraping.

All you have to do is define some examples of scraped data. mlscraper will figure out everything else and return clean data.

How it works

After you’ve defined the data you want to scrape, mlscraper will:

  • find your samples inside the HTML DOM

  • determine which rules/methods to apply for extraction

  • extract the data for you and return it in a dictionary

Getting started

Install the latest version of mlscraper via pip install git+https://github.com/lorey/mlscraper#egg=mlscraper. Please note that until the 1.0 release pip install mlscraper will return an outdated 0.* version. In both cases, you can then import it via mlscraper. Check the tests for usage until detailed documentation arrives.

Development

See CONTRIBUTING.rst

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlscraper-1.0.0rc1.tar.gz (11.3 kB view hashes)

Uploaded Source

Built Distribution

mlscraper-1.0.0rc1-py2.py3-none-any.whl (12.2 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page