Skip to main content

Yet another Python web scraping application

Project description

Build Status Coverage Status PyPI version Code style: black

ScrapeMeAgain

ScrapeMeAgain is a Python 3 powered web scraper. It uses multithreading (ThreadPoolExecutor) and multiprocessing to get the work done quicker and stores collected data in an SQLite database.

Installation

pip install scrapemeagain

System requirements

Tor in combination with Privoxy are used for anonymity (i.e. regular IP address changes).

Using Docker and Docker Compose is the preferred (and easier) way to use ScrapeMeAgain.

You have to manually install and setup Tor and Privoxy on your system if not using Docker. For further information about installation and configuration refer to:

Usage

You have to provide your own database table description and an actual scraper class which must follow the BaseScraper interface. See examples/examplescraper for more details.

Dockerized

With Docker it is possible to use multiple Tor IPs at the same time and, unless you abuse it, scrape data faster.

The easiest way to start is to duplicate examples/examplescraper and then update, rename, expand, etc. your scraper and related classes.

Your scraper must define config.py and main_dockerized.py. These two names are hardcoded throughout the codebase.

scrapemeagain-compose.py dynamically creates a docker-compose.yml which orchestrates scraper instances. The idea is that the first scraper (scp1) is a master scraper and hence is the host for a couple of helper services which communicate over HTTP (see dockerized/apps).

  1. Get Docker host Ip
ip addr show docker0

NOTE Your Docker interface name may be different from docker0.

  1. Run examplesite on Docker host IP
python3 examples/examplescraper/examplesite/app.py 172.17.0.1

NOTE Your Docker host IP may be different from 172.17.0.1.

  1. Start docker-compose
scrapemeagain-compose.py -s $(pwd)/examples/examplescraper -c tests.integration.fake_config | docker-compose -f - up

NOTE A special config file path is provided: -c tests.integration.fake_config. This is required only for test/demo purposes. You don't have to provide specific config for a real/production scraper.

Local

  1. Run examplesite
python3 examples/examplescraper/examplesite/app.py
  1. Start examplescraper
python3 examples/examplescraper/main.py

NOTE You may need to update your PYTHONPATH, e.g. export PYTHONPATH=$PYTHONPATH:$(pwd)/examples.

Development

To simplify running integration tests with latest changes:

  • replace image: dusanmadar/scrapemeagain:x.y.z with image: scp:latest in the scrapemeagain/dockerized/docker-compose.yml template

  • and make sure to rebuild the image locally before running tests, e.g.

docker build . -t scp:latest; python -m unittest discover -p test_integration.py

Legacy

The Python 2.7 version of ScrapeMeAgain, which also provides geocoding capabilities, is available under the legacy branch and is no longer maintained.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapemeagain-1.0.5.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

scrapemeagain-1.0.5-py3-none-any.whl (35.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapemeagain-1.0.5.tar.gz.

File metadata

  • Download URL: scrapemeagain-1.0.5.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.14.2 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for scrapemeagain-1.0.5.tar.gz
Algorithm Hash digest
SHA256 b5b03965204274fa8b2aa190c9297bbf4e9113f7a6067505fa6f5fa560e15a39
MD5 938572f125624db38ff3a5293b0a8f50
BLAKE2b-256 bb9e8f907a02dd4067de48dca3f929c1239245a0fb3b0a589a6b84adab3de91a

See more details on using hashes here.

File details

Details for the file scrapemeagain-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: scrapemeagain-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 35.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.14.2 setuptools/40.5.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for scrapemeagain-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 0072374022b5f87268ea544e86cdd6a6851b836f35e3c24f76d948789ac30226
MD5 724e87a312f7201242d3fbbea75c1ff8
BLAKE2b-256 ba9efeb2e1a4f794a427e5a70c8b34ea3777fcd8abc9c53bf7205e12ac085695

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page