Skip to main content

Fast and easy to use scraper for the content-centered web pages, e.g. blog posts, news, etc.

Project description

scrab - Fuzzy content scraper

Python package PyPI - Python Version GitHub Release GitHub Release License: MIT

Fast and easy to use content scraper for topic-centred web pages, e.g. blog posts, news and wikis.

The tool uses heuristics to extract main content and ignores surrounding noise. No processing rules. No XPath. No configuration.

Installing

pip install scrab

Usage

scrab https://blog.post

Store extracted content to a file:

scrab https://blog.post > content.txt

ToDo List

  • Support <main> tag
  • Add support for lists
  • Add support for scripts
  • Add support for markdown output format
  • Download and save referenced images
  • Extract and embed links

Development

# Lint with flake8
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

# Check with mypy
mypy ./scrab
mypy ./tests

# Run tests
pytest

Publish to PyPI:

rm -rf dist/*
python setup.py sdist bdist_wheel
twine upload dist/*

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrab-0.0.6.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

scrab-0.0.6-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file scrab-0.0.6.tar.gz.

File metadata

  • Download URL: scrab-0.0.6.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.2

File hashes

Hashes for scrab-0.0.6.tar.gz
Algorithm Hash digest
SHA256 196a2ef7f342a38c3998065e86fb60fde9e1d93e54c2f40f3850958cd1728b62
MD5 18129abb3ef44f922df3fae04deb4a19
BLAKE2b-256 8b9e2cb6cdf023fee7d5578ae43c1814bf212da117910ebc0b6ce8cb440685fa

See more details on using hashes here.

File details

Details for the file scrab-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: scrab-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.2

File hashes

Hashes for scrab-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 fcb89f503fc5f813c65c49dc69eeb4d1f3ed31f1098e91fd71cb108792844c75
MD5 eb759281d17251cf950b58b897cb48d2
BLAKE2b-256 00702a71e835571c5022782533c7c2eb628f02fcf1492581cb5f4c31dfe4043f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page