An automated, programming-free web scraper for interactive sites
A project of Artificial Informer Labs.
AutoScrape is an automated scraper of structured data from interactive web pages. You point this scraper at a site and it will be crawled, searched for forms and structured data can then be extracted. No brittle, site-specific programming necessary.
This is an implementation of the web scraping framework described in the paper, Robust Web Scraping in the Public Interest with AutoScrape and presented at Computation + Journalism Symposium 2019.
Currently there are two methods of running AutoScrape:
- as a local CLI python script
- as a containerized system via the API and Web UI
Installation and running instructions are provided for both below.
Two ways, easiest first.
pip install autoscrape autoscrape -h
git clone https://github.com/brandonrobertz/autoscrape-py cd autoscrape-py/ python setup.py install autoscrape -h
Either way, you can now use autoscrape from the command line.
Here are some straightforward use cases for AutoScrape and how you’d use the CLI tool to execute them. These, of course, assume you have the dependencies installed.
You can control the backened with the --backend option:
autoscrape \ --backend requests \ --output requests_crawled_site \ 'https://some.page/to-crawl'
Crawl an entire website, saving all HTML and stylesheets (no screenshots):
autoscrape \ --backend requests \ --maxdepth -1 \ --output crawled_site \ 'https://some.page/to-crawl'
Archive Page (Screenshot & Code)
Archive a single webpage, both code and full-content screenshot (PNG), for future reference:
autoscrape \ --backend selenium \ --full-page-screenshots \ --load-images --maxdepth 0 \ --save-screenshots --driver Firefox \ --output archived_webpage \ 'https://some.page/to-archive'
Search Forms and Crawl Result Pages
Query a web form, identified by containing the text “I’m a search form”, entering “NAME” into the first (0th) text input field and select January 20th, 1992 in the second (1st) date field. Then click all buttons with the text “Next ->” to get all results pages:
autoscrape \ --backend selenium \ --output search_query_data \ --form-match "I'm a search form" \ --input "i:0:NAME,d:1:1992-01-20" \ --next-match "Next ->" \ 'https://some.page/search?s=newquery'
Setup for Standalone Local CLI
If you want to use the selenium backend for interactive crawling, you need to have geckodriver installed. You can do that here:
Your geckodriver needs to be compatible with your current version of Firefox or you will get errors.
If you prefer to use Chrome, you will need the ChromeDriver (we’ve tested using v2.41). It can be found in your distribution’s package manager or here:
Installing the remaining Python dependencies can be done using pip.
Pip Install Method
Next you need to set up your python virtual environment (Python 3.6 required) and install the Python dependencies:
pip install -r requirements.txt
Running Standalone Scraper
Environment Test Crawler
You can run a test to ensure your webdriver is set up correctly by running the test crawler:
./autoscrape --backend selenium --show-browser [SITE_URL]
The test crawler will just do a depth-first click-only crawl of an entire website. It will not interact with forms or POST data. Data will be saved to ./autoscrape-data/ (the default output directory).
Manual Config-Based Scraper
Autoscrape has a manually controlled mode, similar to wget, except this uses interactive capabilities and can input data to search forms, follow “next page”-type buttons, etc. This functionality can be used either as a standalone crawler/scraper or as a method to build a training set for the automated scrapers.
Autoscrape manual-mode full options:
Setup Containerized API Version
AutoScrape can also be ran as a containerized cluster environment, where scrapes can be triggered and stopped via API calls and data can be streamed to this server.
This requires the autoscrape-www submodule to be pulled:
git submodule init git submodule update
This will pull the browser-based UI into the www/ folder.
docker-compose build --pull docker-compose up -t0 --abort-on-container-exit
This will build the containers and launch a API server running on local port 5000. More information about the API calls can be found in autoscrape-server.py.
If you have make installed, you can simply run make start.
NOTE: This is a work in progress prototype that will likely be removed once AutoScrape is integrated into CJ Workbench.
Release history Release notifications
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size autoscrape-1.1.8.tar.gz (60.0 kB)||File type Source||Python version None||Upload date||Hashes View hashes|