Software Heritage indexer
Project description
Tools to compute multiple indexes on SWH’s raw contents:
content:
mimetype
fossology-license
metadata
origin:
metadata (intrinsic, using the content indexer; and extrinsic)
An indexer is in charge of:
looking up objects
extracting information from those objects
store those information in the swh-indexer db
There are multiple indexers working on different object types:
content indexer: works with content sha1 hashes
revision indexer: works with revision sha1 hashes
origin indexer: works with origin identifiers
Indexation procedure:
receive batch of ids
retrieve the associated data depending on object type
compute for that object some index
store the result to swh’s storage
Current content indexers:
mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype
fossology-license (queue swh_indexer_fossology_license): compute the license
metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta vocabulary)
Current origin indexers:
metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta and ForgeFed vocabularies)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for swh.indexer-3.6.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38dbe5203f3362a58ef79d00e229cc0ddc42d562ba2f87fc333c115ec76f2335 |
|
MD5 | 74e42d6ac5f5cf9a7875b1a57c427a25 |
|
BLAKE2b-256 | 66777c9f62bee7745e11e0bbc2036256b18f9c86b36a38cffcca1a0caf617bf2 |