Skip to main content

Fast and easy to use scraper for the content-centered web pages, e.g. blog posts, news, etc.

Project description

scrab - Fuzzy content scraper

Python package PyPI - Python Version GitHub Release GitHub Release License: MIT

Fast and easy to use content scraper for topic-centred web pages, e.g. blog posts, news and wikis.

The tool uses heuristics to extract main content and ignores surrounding noise. No processing rules. No XPath. No configuration.

Installing

pip install scrab

Usage

scrab https://blog.post

Store extracted content to a file:

scrab https://blog.post > content.txt

ToDo List

  • Add support for lists
  • Add support for scripts
  • Add support for markdown output format
  • Download and save referenced images
  • Extract and embed links

Development

# Lint with flake8
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

# Check with mypy
mypy ./scrab
mypy ./tests

# Run tests
pytest

Publish to PyPI:

rm -rf dist/*
python setup.py sdist bdist_wheel
twine upload dist/*

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrab-0.0.3.tar.gz (2.8 kB view details)

Uploaded Source

Built Distribution

scrab-0.0.3-py3-none-any.whl (3.4 kB view details)

Uploaded Python 3

File details

Details for the file scrab-0.0.3.tar.gz.

File metadata

  • Download URL: scrab-0.0.3.tar.gz
  • Upload date:
  • Size: 2.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.2

File hashes

Hashes for scrab-0.0.3.tar.gz
Algorithm Hash digest
SHA256 d3e6d6508e3e0cf2062285c0f6c60bfdaf96f060212061b146aff14c34701cf3
MD5 55bed902672a6a7c1a5354128577e657
BLAKE2b-256 c00c14ba6bf035796af94d0dce8b781c8673ac5a39ad55784219e2961bdb5104

See more details on using hashes here.

File details

Details for the file scrab-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: scrab-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 3.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.2

File hashes

Hashes for scrab-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1ceacd2409771769b40498200bb8474e1ba86b796d376b1dec8be66f66d569cb
MD5 3a558ab2fa99a625a630028acb2bb8e5
BLAKE2b-256 90ccef54ad8837d017599dae970f29635e145c6a599ccd0f92eb42573879ee70

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page