Scrape HTML automatically with machine learning.
Project description
Turn HTML intro structured data automatically with Machine Learning with mlscraper
mlscraper allows you to extract structured data from HTML automatically with Machine Learning. You train it by providing a few examples of your desired output. It will then be able to extract this information from any new page you provide.
Background Story
Many services for crawling and scraping automation allow you to select data in a browser and get JSON results in return. No need to specify CSS selectors or anything else.
I’ve been wondering for a long time why there’s no Open Source solution that does something like this. So here’s my attempt at creating a python library to enable automatic scraping.
All you have to do is define some examples of scraped data. autoscraper will figure out everything else and return clean data.
Currently, this is a proof of concept with a simplistic solution.
How it works
After you’ve defined the data you want to scrape, mlscraper will:
find your samples inside the HTML DOM
determine which rules/methods to apply for extraction
extract the data for you and return it in a dictionary
from mlscraper import MultiItemScraper
from mlscraper.training import MultiItemPageSample
# the items found on the training page
items = [
{"title": "One great result!", "description": "Some description"},
{"title": "Another great result!", "description": "Another description"},
{"title": "Result to be found", "description": "Description to crawl"},
]
# training the scraper with the items
sample = MultiItemPageSample(html, items)
scraper = MultiItemScraper.build([sample])
scraper.scrape(html) # will produce the items above
scraper.scrape(new_html) # will apply the learned rules and extract new items
Getting started
Install the library locally via pip install -e .. You can then import it via mlscraper and use it as shown in the examples.
Development
See CONTRIBUTING.rst
History
0.1.2 (2020-09-27)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for mlscraper-0.1.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32b28fbfe438290d6852f51e28b2e4def071284e9060a3d67df187347749e0ee |
|
MD5 | 6d78e00600c14ab86240a7a7e626d931 |
|
BLAKE2b-256 | 3af2213569bfbeb50bdfa278456d5f17125dd481f3f1113d8f792a05692e198b |