Skip to main content

utilities for working with data in the vicinity of Git and git-annex

Project description

DataSALad

PyPI version fury.io Documentation Status Hatch project

This is a pure-Python library with a collection of utilities for working with data in the vicinity of Git and git-annex. While this is a foundational library from and for the DataLad project, its implementations are standalone, and are meant to be equally well usable outside the DataLad system.

A focus of this library is efficient communication with subprocesses, such as Git or git-annex commands, which read and produce data in some format.

Here is a demo of what can be accomplished with this library. The following code queries a remote git-annex repository via a git annex find command running over an SSH connection in batch-mode. The output in JSON-lines format is then itemized and decoded to native Python data types. Both inputs and outputs are iterables with meaningful items, even though at a lower level information is transmitted as an arbitrarily chunked byte stream.

>>> from more_itertools import intersperse
>>> from pprint import pprint
>>> from datasalad.runners import iter_subproc
>>> from datasalad.itertools import (
...     itemize,
...     load_json,
... )

>>> # a bunch of photos we are interested in
>>> interesting = [
...     b'DIY/IMG_20200504_205821.jpg',
...     b'DIY/IMG_20200505_082136.jpg',
... ]

>>> # run `git-annex find` on a remote server in a repository
>>> # that has these photos in the worktree.
>>> with iter_subproc(
...     ['ssh', 'photos@pididdy.local',
...      'git -C "collections" annex find --json --batch'],
...     # the remote process is fed the file names,
...     # and a newline after each one to make git-annex write
...     # a report in JSON-lines format
...     inputs=intersperse(b'\n', interesting),
... ) as remote_annex:
...     # we loop over the output of the remote process.
...     # this is originally a byte stream downloaded in arbitrary
...     # chunks, so we itemize at any newline separator.
...     # each item is then decoded from JSON-lines format to
...     # native datatypes
...     for rec in load_json(itemize(remote_annex, sep=b'\n')):
...         # for this demo we just pretty-print it
...         pprint(rec)
{'backend': 'SHA256E',
 'bytesize': '3357612',
 'error-messages': [],
 'file': 'DIY/IMG_20200504_205821.jpg',
 'hashdirlower': '853/12f/',
 'hashdirmixed': '65/qp/',
 'humansize': '3.36 MB',
 'key': 'SHA256E-s3357612--700a52971714c2707c2de975f6015ca14d1a4cdbbf01e43d73951c45cd58c176.jpg',
 'keyname': '700a52971714c2707c2de975f6015ca14d1a4cdbbf01e43d73951c45cd58c176.jpg',
 'mtime': 'unknown'}
{'backend': 'SHA256E',
 'bytesize': '3284291',
 ...

Developing with datasalad

API stability is important, just as adequate semantic versioning, and informative changelogs.

Public vs internal API

Anything that can be imported directly from any of the sub-packages in datasalad is considered to be part of the public API. Changes to this API determine the versioning, and development is done with the aim to keep this API as stable as possible. This includes signatures and return value behavior.

As an example: from datasalad.runners import iter_git_subproc imports a part of the public API, but from datasalad.runners.git import iter_git_subproc does not.

Use of the internal API

Developers can obviously use parts of the non-public API. However, this should only be done with the understanding that these components may change from one release to another, with no guarantee of transition periods, deprecation warnings, etc.

Developers are advised to never reuse any components with names starting with _ (underscore). Their use should be limited to their individual subpackage.

Contributing

Contributions to this library are welcome! Please see the contributing guidelines for details on scope and style of potential contributions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasalad-0.7.0.tar.gz (61.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datasalad-0.7.0-py3-none-any.whl (65.9 kB view details)

Uploaded Python 3

File details

Details for the file datasalad-0.7.0.tar.gz.

File metadata

  • Download URL: datasalad-0.7.0.tar.gz
  • Upload date:
  • Size: 61.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.2 cpython/3.12.11 HTTPX/0.28.1

File hashes

Hashes for datasalad-0.7.0.tar.gz
Algorithm Hash digest
SHA256 3533808013ee906f77c092ab5b6e51edc26b0513bf8a7e1ee60937eb9f3a5150
MD5 7d632ba09cf7a03de0d19aafc34b75e6
BLAKE2b-256 8c616a55ec63862a8d12431bae44a340e31eaab125788be63a07ed07d0dfab26

See more details on using hashes here.

File details

Details for the file datasalad-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: datasalad-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 65.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.2 cpython/3.12.11 HTTPX/0.28.1

File hashes

Hashes for datasalad-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87fd074b0c56082f3e2d1194dfc4b6af40b38bffbbb9cbbf9d758de078766bb8
MD5 6febae58f47c1afd93f190d585a3cf35
BLAKE2b-256 cd4e82997e0c41c508bbea101181bebf4901ee496a7048dae3f0eeeaf2cef555

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page