Skip to main content

Fast and easy to use scraper for the content-centered web pages, e.g. blog posts, news, etc.

Project description

scrab - Fuzzy content scraper

Python package PyPI - Python Version GitHub Release GitHub Release License: MIT

Fast and easy to use content scraper for topic-centred web pages, e.g. blog posts, news and wikis.

The tool uses heuristics to extract main content and ignores surrounding noise. No processing rules. No XPath. No configuration.

Installing

pip install scrab

Usage

scrab https://blog.post

Store extracted content to a file:

scrab https://blog.post > content.txt

ToDo List

  • Add support for lists
  • Add support for scripts
  • Add support for markdown output format
  • Download and save referenced images
  • Extract and embed links

Development

# Lint with flake8
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

# Check with mypy
mypy ./scrab
mypy ./tests

# Run tests
pytest

Publish to PyPI:

rm -rf dist/*
python setup.py sdist bdist_wheel
twine upload dist/*

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrab-0.0.5.tar.gz (6.8 kB view details)

Uploaded Source

Built Distributions

scrab-0.0.5-py3.8.egg (13.4 kB view details)

Uploaded Source

scrab-0.0.5-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file scrab-0.0.5.tar.gz.

File metadata

  • Download URL: scrab-0.0.5.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.2

File hashes

Hashes for scrab-0.0.5.tar.gz
Algorithm Hash digest
SHA256 2d7601d2a6641ea2ceede61c47bf25da62fd75166ad3dd7c55e1625cb9a49c57
MD5 3d18f1552e1965d91a949d2b172ab7c4
BLAKE2b-256 244ed0eb5368483b1dd034d38ce0c5a0c82e7549dce7d04e702dfe9751a4c85a

See more details on using hashes here.

File details

Details for the file scrab-0.0.5-py3.8.egg.

File metadata

  • Download URL: scrab-0.0.5-py3.8.egg
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.2

File hashes

Hashes for scrab-0.0.5-py3.8.egg
Algorithm Hash digest
SHA256 7ffcad65173dea7009ea0c52d728cc0f127db82a12b5828bb82a2319c8789f00
MD5 a17aecd2f791e51a653eb8fd47cd90d5
BLAKE2b-256 215489838412ebdffde925871ab1f01c5b15121b9eee8746bdf48db80ca849eb

See more details on using hashes here.

File details

Details for the file scrab-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: scrab-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.2

File hashes

Hashes for scrab-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 30f064b6c837fec97250b039cb815cdc15344dea81e01b7283141e6c21c2c465
MD5 9e714ebf9683e3a76096b7bea106a6d2
BLAKE2b-256 c06c5fa2daece15bbde343384e9db9b3b2be52006e002113f55671112901080b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page