Skip to main content

A minimalistic, recursive web crawling library for Python.

Project description

The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.

Funes the Memorious, Jorge Luis Borges

https://github.com/alephdata/memorious/workflows/memorious/badge.svg

memorious is a light-weight web scraping toolkit. It supports scrapers that collect structured or un-structured data. This includes the following use cases:

  • Make crawlers modular and simple tasks reusable

  • Provide utility functions to do common tasks such as data storage, HTTP session management

  • Integrate crawlers with the Aleph and FollowTheMoney ecosystem

  • Get out of your way as much as possible

Design

When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.

memorious handles this by managing a set of crawlers, each of which can be composed of multiple stages. Each stage is implemented using a Python function, which can be reused across different crawlers.

The basic steps of writing a Memorious crawler:

  1. Make YAML crawler configuration file

  2. Add different stages

  3. Write code for stage operations (optional)

  4. Test, rinse, repeat

Documentation

The documentation for Memorious is available at docs.investigraph.dev/lib/memorious. Feel free to edit the source files in the docs folder and send pull requests for improvements.

To serve the documentation locally, run mkdocs serve

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

memorious4-4.0.0.tar.gz (58.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

memorious4-4.0.0-py3-none-any.whl (76.4 kB view details)

Uploaded Python 3

File details

Details for the file memorious4-4.0.0.tar.gz.

File metadata

  • Download URL: memorious4-4.0.0.tar.gz
  • Upload date:
  • Size: 58.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.5 Linux/6.12.63+deb13-amd64

File hashes

Hashes for memorious4-4.0.0.tar.gz
Algorithm Hash digest
SHA256 987287f4ed0e77f77a0720ad180882f9ab956f47e601fb0f0b7f1d97caa8d20b
MD5 d263b3db673f6bbfb36f2648f26359dd
BLAKE2b-256 f02e6405ab38f8377ea33c5d4ce77907a9ccd45aff4ff7ca7e0ab8bfa40308f4

See more details on using hashes here.

File details

Details for the file memorious4-4.0.0-py3-none-any.whl.

File metadata

  • Download URL: memorious4-4.0.0-py3-none-any.whl
  • Upload date:
  • Size: 76.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.5 Linux/6.12.63+deb13-amd64

File hashes

Hashes for memorious4-4.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e87066073076b0cd046dab38ffe968e03748aa305e92d1d85876268de3c0e3d
MD5 cf305aa5b5791de895dd8144dfdb0896
BLAKE2b-256 041e909a6546a7ae40fe01b33b3dbd809b3c5fbda15eb691ee26d1c27a36acac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page