Yet another Python web scraping application
pip install scrapemeagain
Docker Compose is the preferred (and easier) way to
You have to manually install and setup
Privoxy on your system if not using
Docker. For further information about installation and configuration refer to:
- A step-by-step guide how to use Python with Tor and Privoxy
- Crawling anonymously with Tor in Python (alternative link (Gist))
You have to provide your own database table description and an actual scraper class which must follow the
BaseScraper interface. See
examples/examplescraper for more details.
With Docker it is possible to use multiple Tor IPs at the same time and, unless you abuse it, scrape data faster.
The easiest way to start is to duplicate
examples/examplescraper and then update, rename, expand, etc. your scraper and related classes.
Your scraper must define
main_dockerized.py. These two names are hardcoded throughout the codebase.
scrapemeagain-compose.py dynamically creates a
docker-compose.yml which orchestrates scraper instances. The idea is that the first scraper (
scp1) is a
master scraper and hence is the host for a couple of helper services which communicate over HTTP (see
- Get Docker host Ip
ip addr show docker0
NOTE Your Docker interface name may be different from docker0.
examplesiteon Docker host IP
python3 examples/examplescraper/examplesite/app.py 172.17.0.1
NOTE Your Docker host IP may be different from 172.17.0.1.
scrapemeagain-compose.py -s $(pwd)/examples/examplescraper -c tests.integration.fake_config | docker-compose -f - up
NOTE A special config file path is provided:
-c tests.integration.fake_config. This is required only for test/demo purposes. You don't have to provide specific config for a real/production scraper.
NOTE You may need to update your
To simplify running integration tests with latest changes:
image: scp:latestin the
and make sure to rebuild the image locally before running tests, e.g.
docker build . -t scp:latest; python -m unittest discover -p test_integration.py
The Python 2.7 version of ScrapeMeAgain, which also provides geocoding capabilities, is available under the
legacy branch and is no longer maintained.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for scrapemeagain-1.0.5-py3-none-any.whl