Scrape HTML automatically
Project description
mlscraper allows you to extract structured data from HTML automatically instead of manually specifying nodes or css selectors. You train it by providing a few examples of your desired output. It will then figure out the extraction rules for you automatically and afterwards you’ll be able to extract data from any new page you provide.
Background Story
Many services for crawling and scraping automation allow you to select data in a browser and get JSON results in return. No need to specify CSS selectors or anything else.
I’ve been wondering for a long time why there’s no Open Source solution that does something like this. So here’s my attempt at creating a python library to enable automatic scraping.
All you have to do is define some examples of scraped data. mlscraper will figure out everything else and return clean data.
How it works
After you’ve defined the data you want to scrape, mlscraper will:
find your samples inside the HTML DOM
determine which rules/methods to apply for extraction
extract the data for you and return it in a dictionary
Getting started
mlscraper is currently short before version 1.0.
If you want to check the new release, use pip install --pre mlscraper
to test the release candidate.
You can also install the latest (unstable) development version of mlscraper
via pip install git+https://github.com/lorey/mlscraper#egg=mlscraper
,
e.g. to check new features or to see if a bug has been fixed already.
Please note that until the 1.0 release pip install mlscraper
will return an outdated 0.* version.
Check the examples directory for usage examples until further documentation arrives.
Development
See CONTRIBUTING.rst
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for mlscraper-1.0.0rc2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b30402839b8ff105e51f91690de5ba597e27f0f51ade47d12c35fb3aec3ee91b |
|
MD5 | 38c13789a2fa7733ae6dcbaaceebaec2 |
|
BLAKE2b-256 | 35f662691752794b898fe01db318b88f61f6e182ebff0f585f232ffce2e367a0 |