Yet another Python web scraping application
Project description
ScrapeMeAgain
ScrapeMeAgain is a Python 3 powered web scraper. It uses multithreading (ThreadPoolExecutor
) and multiprocessing to get the work done quicker and stores collected data in an SQLite database.
Installation
pip install scrapemeagain
System requirements
Tor in combination with Privoxy are used for anonymity (i.e. regular IP address changes).
Using Docker
and Docker Compose
is the preferred (and easier) way to
use ScrapeMeAgain.
You have to manually install and setup Tor
and Privoxy
on your system if not using Docker
. For further information about installation and configuration refer to:
- A step-by-step guide how to use Python with Tor and Privoxy
- Crawling anonymously with Tor in Python (alternative link (Gist))
Usage
You have to provide your own database table description and an actual scraper class which must follow the BaseScraper
interface. See examples/examplescraper
for more details.
Dockerized
With Docker it is possible to use multiple Tor IPs at the same time and, unless you abuse it, scrape data faster.
The easiest way to start is to duplicate examples/examplescraper
and then update, rename, expand, etc. your scraper and related classes.
Your scraper must define config.py
and main_dockerized.py
. These two names are hardcoded throughout the codebase.
scrapemeagain-compose.py
dynamically creates a docker-compose.yml
which orchestrates scraper instances. The idea is that the first scraper (scp1
) is a master
scraper and hence is the host for a couple of helper services which communicate over HTTP (see dockerized/apps
).
- Get Docker host Ip
ip addr show docker0
NOTE Your Docker interface name may be different from docker0.
- Run
examplesite
on Docker host IP
python3 examples/examplescraper/examplesite/app.py 172.17.0.1
NOTE Your Docker host IP may be different from 172.17.0.1.
- Start
docker-compose
scrapemeagain-compose.py -s $(pwd)/examples/examplescraper -c tests.integration.fake_config | docker-compose -f - up
NOTE A special config file path is provided: -c tests.integration.fake_config
. This is required only for test/demo purposes. You don't have to provide specific config for a real/production scraper.
Local
- Run
examplesite
python3 examples/examplescraper/examplesite/app.py
- Start
examplescraper
python3 examples/examplescraper/main.py
NOTE You may need to update your PYTHONPATH
, e.g. export PYTHONPATH=$PYTHONPATH:$(pwd)/examples
.
Development
To simplify running integration tests with latest changes:
-
replace
image: dusanmadar/scrapemeagain:x.y.z
withimage: scp:latest
in thescrapemeagain/dockerized/docker-compose.yml
template -
and make sure to rebuild the image locally before running tests, e.g.
docker build . -t scp:latest; python -m unittest discover -p test_integration.py
Legacy
The Python 2.7 version of ScrapeMeAgain, which also provides geocoding capabilities, is available under the legacy
branch and is no longer maintained.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapemeagain-1.0.5.tar.gz
.
File metadata
- Download URL: scrapemeagain-1.0.5.tar.gz
- Upload date:
- Size: 27.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.14.2 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5b03965204274fa8b2aa190c9297bbf4e9113f7a6067505fa6f5fa560e15a39 |
|
MD5 | 938572f125624db38ff3a5293b0a8f50 |
|
BLAKE2b-256 | bb9e8f907a02dd4067de48dca3f929c1239245a0fb3b0a589a6b84adab3de91a |
File details
Details for the file scrapemeagain-1.0.5-py3-none-any.whl
.
File metadata
- Download URL: scrapemeagain-1.0.5-py3-none-any.whl
- Upload date:
- Size: 35.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.14.2 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0072374022b5f87268ea544e86cdd6a6851b836f35e3c24f76d948789ac30226 |
|
MD5 | 724e87a312f7201242d3fbbea75c1ff8 |
|
BLAKE2b-256 | ba9efeb2e1a4f794a427e5a70c8b34ea3777fcd8abc9c53bf7205e12ac085695 |