An automated, programming-free web scraper for interactive sites
A project of Artificial Informer Labs.
AutoScrape is an automated scraper of structured data from interactive web pages. You point this scraper at a site, give it a little information and structured data can then be extracted. No brittle, site-specific programming necessary.
This is an implementation of the web scraping framework described in the paper, Robust Web Scraping in the Public Interest with AutoScrape and presented at Computation + Journalism Symposium 2019.
Currently there are two methods of running AutoScrape:
- as a local CLI python script
- a full Web interface for scraping (see bottom of page)
Installation and running instructions are provided for both below.
Two ways, easiest first.
pip install autoscrape[all] autoscrape --backend requests --output outdir --maxdepth 2 https://bxroberts.org
This will install all dependencies for all backends and various options.
git clone https://github.com/brandonrobertz/autoscrape-py cd autoscrape-py/ pip install .[all] autoscrape --backend requests --output outdir --maxdepth 2 https://bxroberts.org
Either way, you can now use autoscrape from the command line.
Here are some straightforward use cases for AutoScrape and how you’d use the CLI tool to execute them. These, of course, assume you have the dependencies installed.
You can control the backened with the --backend option:
autoscrape \ --backend requests \ --output requests_crawled_site \ 'https://some.page/to-crawl'
In order to use backends other than requests, you need to install the proper dependencies. pip install autoscrape[all] will install everything required for all backends/functionality, but you can also install dependencies in isolation:
Selenium backend: pip install autoscrape[selenium-backend]
Crawl graph builder (for use in –save-graph) pip install autoscrape[graph]
WARC backend: pip install autoscrape[warc-backend]
Note that for the Selenium backend, you need to install geckodriver or chromedriver, depending if you’re using Firefox or Chrome, respectively. More information is below in the External Dependencies section.
Crawl an entire website, saving all HTML and stylesheets (no screenshots):
autoscrape \ --backend requests \ --maxdepth -1 \ --output crawled_site \ 'https://some.page/to-crawl'
Archive Page (Screenshot & Code)
Archive a single webpage, both code and full-content screenshot (PNG), for future reference:
autoscrape \ --backend selenium \ --full-page-screenshots \ --load-images --maxdepth 0 \ --save-screenshots --driver Firefox \ --output archived_webpage \ 'https://some.page/to-archive'
Search Forms and Crawl Result Pages
Query a web form, identified by containing the text “I’m a search form”, entering “NAME” into the first (0th) text input field and select January 20th, 1992 in the second (1st) date field. Then click all buttons with the text “Next ->” to get all results pages:
autoscrape \ --backend selenium \ --output search_query_data \ --form-match "I'm a search form" \ --input "i:0:NAME,d:1:1992-01-20" \ --next-match "Next ->" \ 'https://some.page/search?s=newquery'
Setup for Standalone Local CLI
If you want to use the selenium backend for interactive crawling, you need to have geckodriver installed. You can do that here:
Or through your package manager:
- apt install firefox-geckodriver
Your geckodriver needs to be compatible with your current version of Firefox or you will get errors. If you install FF and the driver through your package manager, you should be okay, but it’s not guaranteed. We have specific versions of both pinned in the Dockerfile.
If you prefer to use Chrome, you will need the ChromeDriver (we’ve tested using v2.41). It can be found in your distribution’s package manager or here:
Installing the remaining Python dependencies can be done using pip.
Pip Install Method
Next you need to set up your python virtual environment (Python 3.6 required) and install the Python dependencies:
pip install -r requirements.txt
Running Standalone Scraper
Environment Test Crawler
You can run a test to ensure your webdriver is set up correctly by running the test crawler:
./autoscrape --backend selenium --show-browser [SITE_URL]
The test crawler will just do a depth-first click-only crawl of an entire website. It will not interact with forms or POST data. Data will be saved to ./autoscrape-data/ (the default output directory).
Manual Config-Based Scraper
Autoscrape has a manually controlled mode, similar to wget, except this uses interactive capabilities and can input data to search forms, follow “next page”-type buttons, etc. This functionality can be used either as a standalone crawler/scraper or as a method to build a training set for the automated scrapers.
Autoscrape manual-mode full options:
AutoScrape Web UI (Docker)
AutoScrape can be ran as a containerized cluster environment, where scrapes can be triggered and stopped via API calls and data can be streamed to this server.
This requires the autoscrape-www submodule to be pulled:
git submodule init git submodule update
This will pull the browser-based UI into the www/ folder.
docker-compose build --pull docker-compose up
This will build the containers and launch a API server running on local port 5000. More information about the API calls can be found in autoscrape-server.py.
If you have make installed, you can simply run make start.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.