Skip to main content

Software Heritage indexer

Project description

Tools to compute multiple indexes on SWH’s raw contents:

  • content:

    • mimetype

    • fossology-license

    • metadata

  • origin:

    • metadata (intrinsic, using the content indexer; and extrinsic)

An indexer is in charge of:

  • looking up objects

  • extracting information from those objects

  • store those information in the swh-indexer db

There are multiple indexers working on different object types:

  • content indexer: works with content sha1 hashes

  • revision indexer: works with revision sha1 hashes

  • origin indexer: works with origin identifiers

Indexation procedure:

  • receive batch of ids

  • retrieve the associated data depending on object type

  • compute for that object some index

  • store the result to swh’s storage

Current content indexers:

  • mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype

  • fossology-license (queue swh_indexer_fossology_license): compute the license

  • metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta vocabulary)

Current origin indexers:

  • metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta and ForgeFed vocabularies)

Custom indexers and metadata mappings can be added as plugins.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swh_indexer-4.7.0.tar.gz (196.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swh_indexer-4.7.0-py3-none-any.whl (242.0 kB view details)

Uploaded Python 3

File details

Details for the file swh_indexer-4.7.0.tar.gz.

File metadata

  • Download URL: swh_indexer-4.7.0.tar.gz
  • Upload date:
  • Size: 196.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for swh_indexer-4.7.0.tar.gz
Algorithm Hash digest
SHA256 fbe615a036bc6fd1d590d29a03ae2c4c2954989f1aa655d675da2b7cf7d4aaa0
MD5 cff0fb792737d02bf6e1ab8d8a24cae6
BLAKE2b-256 f71f125a8bd4447e300393510e12ce9e7b5e612e1acf7c66b319c859d513a216

See more details on using hashes here.

File details

Details for the file swh_indexer-4.7.0-py3-none-any.whl.

File metadata

  • Download URL: swh_indexer-4.7.0-py3-none-any.whl
  • Upload date:
  • Size: 242.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for swh_indexer-4.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 25065b62cbfbf05a03af033c039ecadaf4e75f35bace961429913f29846d7570
MD5 97e63794ee12c73afb1203bc5ade0fcd
BLAKE2b-256 c2eddc1514cca4f2091d3d0fd29714aec5bfdc21c2f1a9005c3fb20c83df7729

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page