Skip to main content

Scrape HTML automatically with machine learning.

Project description

Turn HTML intro structured data automatically with Machine Learning with mlscraper

.github/how-it-works.png

mlscraper allows you to extract structured data from HTML automatically with Machine Learning. You train it by providing a few examples of your desired output. It will then be able to extract this information from any new page you provide.

https://img.shields.io/travis/lorey/mlscraper:alt:Travis(.org)

Background Story

Many services for crawling and scraping automation allow you to select data in a browser and get JSON results in return. No need to specify CSS selectors or anything else.

I’ve been wondering for a long time why there’s no Open Source solution that does something like this. So here’s my attempt at creating a python library to enable automatic scraping.

All you have to do is define some examples of scraped data. autoscraper will figure out everything else and return clean data.

Currently, this is a proof of concept with a simplistic solution.

How it works

After you’ve defined the data you want to scrape, mlscraper will:

  • find your samples inside the HTML DOM

  • determine which rules/methods to apply for extraction

  • extract the data for you and return it in a dictionary

from mlscraper import MultiItemScraper
from mlscraper.training import MultiItemPageSample

# the items found on the training page
items = [
    {"title": "One great result!", "description": "Some description"},
    {"title": "Another great result!", "description": "Another description"},
    {"title": "Result to be found", "description": "Description to crawl"},
]

# training the scraper with the items
sample = MultiItemPageSample(html, items)
scraper = MultiItemScraper.build([sample])
scraper.scrape(html)  # will produce the items above
scraper.scrape(new_html)  # will apply the learned rules and extract new items

Getting started

Install the library locally via pip install -e .. You can then import it via mlscraper and use it as shown in the examples.

Development

See CONTRIBUTING.rst

History

0.1.2 (2020-09-27)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlscraper-0.1.2.tar.gz (20.2 kB view details)

Uploaded Source

Built Distribution

mlscraper-0.1.2-py2.py3-none-any.whl (12.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file mlscraper-0.1.2.tar.gz.

File metadata

  • Download URL: mlscraper-0.1.2.tar.gz
  • Upload date:
  • Size: 20.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.0

File hashes

Hashes for mlscraper-0.1.2.tar.gz
Algorithm Hash digest
SHA256 01cb99ba55eb296431061912c490c94a3179bbad6692b2c0f22c3ac0c22c6766
MD5 a32633dc23788987a055a144af94d9b6
BLAKE2b-256 a7fc8bf05f00be5776bd6981ec83fa7bfb76639daeb94bb5864b5cc994921cb2

See more details on using hashes here.

File details

Details for the file mlscraper-0.1.2-py2.py3-none-any.whl.

File metadata

  • Download URL: mlscraper-0.1.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.0

File hashes

Hashes for mlscraper-0.1.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 32b28fbfe438290d6852f51e28b2e4def071284e9060a3d67df187347749e0ee
MD5 6d78e00600c14ab86240a7a7e626d931
BLAKE2b-256 3af2213569bfbeb50bdfa278456d5f17125dd481f3f1113d8f792a05692e198b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page