Skip to main content

Fast and easy to use scraper for the content-centered web pages, e.g. blog posts, news, etc.

Project description

scrab - Fuzzy content scraper

Python package PyPI - Python Version GitHub Release GitHub Release License: MIT

Fast and easy to use content scraper for topic-centred web pages, e.g. blog posts, news and wikis.

The tool uses heuristics to extract main content and ignores surrounding noise. No processing rules. No XPath. No configuration.

Installing

pip install scrab

Usage

scrab https://blog.post

Store extracted content to a file:

scrab https://blog.post > content.txt

ToDo List

  • Add support for lists
  • Add support for scripts
  • Add support for markdown output format
  • Download and save referenced images
  • Extract and embed links

Development

# Lint with flake8
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

# Check with mypy
mypy ./scrab
mypy ./tests

# Run tests
pytest

Publish to PyPI:

rm -rf dist/*
python setup.py sdist bdist_wheel
twine upload dist/*

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrab-0.0.2.tar.gz (2.6 kB view details)

Uploaded Source

Built Distribution

scrab-0.0.2-py3-none-any.whl (3.2 kB view details)

Uploaded Python 3

File details

Details for the file scrab-0.0.2.tar.gz.

File metadata

  • Download URL: scrab-0.0.2.tar.gz
  • Upload date:
  • Size: 2.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.2

File hashes

Hashes for scrab-0.0.2.tar.gz
Algorithm Hash digest
SHA256 d8e61eccc6376148a9e2a17116bd88041cc29b0f9b3cd341e128995089f2c493
MD5 575dcf66acc69aee052ffbc4d8482c4f
BLAKE2b-256 0800c0650442ede808f2bddaf5c5beac9809eafb896e565528849d0b4dfae4b3

See more details on using hashes here.

File details

Details for the file scrab-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: scrab-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 3.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.2

File hashes

Hashes for scrab-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a23a63ac27e85bfb6f268cbcad37fd2a802058f6c3753768685636bb226b4b46
MD5 fc69e1768696f16eaaf584b6e01e435a
BLAKE2b-256 16399a25ba98f9ec976eceb114ce8f7248d50b8e29293bcf30fea428418282f0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page