Skip to main content

No project description provided

Project description

Datashare Tarentula CI

Cli toolbelt for Datashare.

     /      \
  \  \  ,,  /  /
   '-.`\()/`.-'
  .--_'(  )'_--.
 / /` /`""`\ `\ \
  |  |  ><  |  |
  \  \      /  /
      '.__.'

Usage: tarentula [OPTIONS] COMMAND [ARGS]...

Options:
  --syslog-address      TEXT    localhost   Syslog address
  --syslog-port         INTEGER 514         Syslog port
  --syslog-facility     TEXT    local7      Syslog facility
  --stdout-loglevel     TEXT    ERROR       Change the default log level for stdout error handler
  --help                                    Show this message and exit
  --version                                 Show the installed version of Tarentula

Commands:
  aggregate
  count
  clean-tags-by-query
  download
  export-by-query
  list-metadata
  tagging
  tagging-by-query


Installation

You can insatll Datashare Tarentula with your favorite package manager:

pip3 install --user tarentula

Or alternativly with Docker:

docker run icij/datashare-tarentula

Usage

Datashare Tarentula comes with basic commands to interact with a Datashare instance (running locally or on a remote server). Primarily focus on bulk actions, it provides you with both a cli interface and a python API.

Cookbook 👩‍🍳

To learn more about how to use Datashare Tarentula with a list of examples, please refer to the Cookbook.

Count

A command to just count the number of files matching a query.

Usage: tarentula count [OPTIONS]

Options:
  --datashare-url           TEXT        Datashare URL
  --datashare-project       TEXT        Datashare project
  --elasticsearch-url       TEXT        You can additionally pass the Elasticsearch
                                          URL in order to use scrollingcapabilities of
                                          Elasticsearch (useful when dealing with a
                                          lot of results)
  --query                   TEXT        The query string to filter documents
  --cookies                 TEXT        Key/value pair to add a cookie to each
                                          request to the API. You can
                                          separatesemicolons: key1=val1;key2=val2;...
  --apikey                  TEXT        Datashare authentication apikey
                                          in the downloaded document from the index
  --traceback / --no-traceback          Display a traceback in case of error
  --type [Document|NamedEntity]         Type of indexed documents to download
  --help                                Show this message and exit

Clean Tags by Query

A command that uses Elasticsearch update-by-query feature to batch untag documents directly in the index.

Usage: tarentula clean-tags-by-query [OPTIONS]

Options:
  --datashare-project       TEXT        Datashare project
  --elasticsearch-url       TEXT        Elasticsearch URL which is used to perform
                                          update by query
  --cookies                 TEXT        Key/value pair to add a cookie to each
                                          request to the API. You can
                                          separatesemicolons: key1=val1;key2=val2;...
  --apikey                  TEXT        Datashare authentication apikey
  --traceback / --no-traceback          Display a traceback in case of error
  --wait-for-completion / --no-wait-for-completion
                                        Create a Elasticsearch task to perform the
                                          updateasynchronously
  --query                   TEXT        Give a JSON query to filter documents that
                                          will have their tags cleaned. It can be
                                          afile with @path/to/file. Default to all.
  --help                                Show this message and exit

Download

A command to download all files matching a query.

Usage: tarentula download [OPTIONS]

Options:
  --apikey TEXT                   Datashare authentication apikey
  --datashare-url TEXT            Datashare URL
  --datashare-project TEXT        Datashare project
  --elasticsearch-url TEXT        You can additionally pass the Elasticsearch
                                  URL in order to use scrollingcapabilities of
                                  Elasticsearch (useful when dealing with a
                                  lot of results)

  --query TEXT                    The query string to filter documents
  --destination-directory TEXT    Directory documents will be downloaded
  --throttle INTEGER              Request throttling (in ms)
  --cookies TEXT                  Key/value pair to add a cookie to each
                                  request to the API. You can
                                  separatesemicolons: key1=val1;key2=val2;...

  --path-format TEXT              Downloaded document path template
  --scroll TEXT                   Scroll duration
  --source TEXT                   A comma-separated list of field to include
                                  in the downloaded document from the index

  -f, --from INTEGER              Passed to the search it will bypass the
                                  first n documents
  -l, --limit INTEGER             Limit the total results to return
  --sort-by TEXT                  Field to use to sort results
  --order-by [asc|desc]           Order to use to sort results
  --once / --not-once             Download file only once
  --traceback / --no-traceback    Display a traceback in case of error
  --progressbar / --no-progressbar
                                  Display a progressbar
  --raw-file / --no-raw-file      Download raw file from Datashare
  --type [Document|NamedEntity]   Type of indexed documents to download
  --help                          Show this message and exit.

Export by Query

A command to export all files matching a query.

Usage: tarentula export-by-query [OPTIONS]

Options:
  --apikey TEXT                   Datashare authentication apikey
  --datashare-url TEXT            Datashare URL
  --datashare-project TEXT        Datashare project
  --elasticsearch-url TEXT        You can additionally pass the Elasticsearch
                                  URL in order to use scrollingcapabilities of
                                  Elasticsearch (useful when dealing with a
                                  lot of results)

  --query TEXT                    The query string to filter documents
  --output-file TEXT              Path to the CSV file
  --throttle INTEGER              Request throttling (in ms)
  --cookies TEXT                  Key/value pair to add a cookie to each
                                  request to the API. You can
                                  separatesemicolons: key1=val1;key2=val2;...

  --scroll TEXT                   Scroll duration
  --source TEXT                   A comma-separated list of field to include
                                  in the export

  --sort-by TEXT                  Field to use to sort results
  --order-by [asc|desc]           Order to use to sort results
  --traceback / --no-traceback    Display a traceback in case of error
  --progressbar / --no-progressbar
                                  Display a progressbar
  --type [Document|NamedEntity]   Type of indexed documents to download
  -f, --from INTEGER              Passed to the search it will bypass the
                                  first n documents
  -l, --limit INTEGER             Limit the total results to return
  --size INTEGER                  Size of the scroll request that powers the
                                  operation.

  --query-field / --no-query-field
                                  Add the query to the export CSV
  --help                          Show this message and exit.

Tagging

A command to batch tag documents with a CSV file.

Usage: tarentula tagging [OPTIONS] CSV_PATH

Options:
  --datashare-url       TEXT        http://localhost:8080   Datashare URL
  --datashare-project   TEXT        local-datashare         Datashare project
  --throttle            INTEGER     0                       Request throttling (in ms)
  --cookies             TEXT        _Empty string_          Key/value pair to add a cookie to each request to the API. You can separate semicolons: key1=val1;key2=val2;...
  --apikey              TEXT        None                    Datashare authentication apikey
  --traceback / --no-traceback                              Display a traceback in case of error
  --progressbar / --no-progressbar                          Display a progressbar
  --help                                                    Show this message and exit

CSV formats

Tagging with a documentId and routing:

tag,documentId,routing
Actinopodidae,l7VnZZEzg2fr960NWWEG,l7VnZZEzg2fr960NWWEG
Antrodiaetidae,DWLOskax28jPQ2CjFrCo
Atracidae,6VE7cVlWszkUd94XeuSd,vZJQpKQYhcI577gJR0aN
Atypidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
Barychelidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi

Tagging with a documentUrl:

tag,documentUrl
Mecicobothriidae,http://localhost:8080/#/d/local-datashare/DbhveTJEwQfJL5Gn3Zgi/DbhveTJEwQfJL5Gn3Zgi
Microstigmatidae,http://localhost:8080/#/d/local-datashare/iuL6GUBpO7nKyfSSFaS0/iuL6GUBpO7nKyfSSFaS0
Migidae,http://localhost:8080/#/d/local-datashare/BmovvXBisWtyyx6o9cuG/BmovvXBisWtyyx6o9cuG
Nemesiidae,http://localhost:8080/#/d/local-datashare/vZJQpKQYhcI577gJR0aN/vZJQpKQYhcI577gJR0aN
Paratropididae,http://localhost:8080/#/d/local-datashare/vYl1C4bsWphUKvXEBDhM/vYl1C4bsWphUKvXEBDhM
Porrhothelidae,http://localhost:8080/#/d/local-datashare/fgCt6JLfHSl160fnsjRp/fgCt6JLfHSl160fnsjRp
Theraphosidae,http://localhost:8080/#/d/local-datashare/WvwVvNjEDQJXkwHISQIu/WvwVvNjEDQJXkwHISQIu

Tagging by Query

A command that uses Elasticsearch update-by-query feature to batch tag documents directly in the index.

To see an example of input file, refer to this JSON.

Usage: tarentula tagging-by-query [OPTIONS] JSON_PATH

Options:
  --datashare-project       TEXT        Datashare project
  --elasticsearch-url       TEXT        Elasticsearch URL which is used to perform
                                          update by query
  --throttle                INTEGER     Request throttling (in ms)
  --cookies                 TEXT        Key/value pair to add a cookie to each
                                          request to the API. You can
                                          separatesemicolons: key1=val1;key2=val2;...
  --apikey                  TEXT        Datashare authentication apikey
  --traceback / --no-traceback          Display a traceback in case of error
  --progressbar / --no-progressbar      Display a progressbar
  --wait-for-completion / --no-wait-for-completion
                                        Create a Elasticsearch task to perform the
                                          updateasynchronously
  --help                                Show this message and exit

List Metadata

You can list the metadata from the mapping, optionally counting the number of occurrences of each field in the index, with the --count parameter. Counting the fields is disabled by default.

It includes a --filter_by parameter to narrow retrieving metadata properties of specific sets of documents. For instance it can be used to get just emails related properties with: --filter_by "contentType=message/rfc822"

$ tarentula list-metadata --help
Usage: tarentula list-metadata [OPTIONS]

Options:
  --datashare-project TEXT       Datashare project
  --elasticsearch-url TEXT       You can additionally pass the Elasticsearch
                                 URL in order to use scrollingcapabilities of
                                 Elasticsearch (useful when dealing with a lot
                                 of results)
  --type [Document|NamedEntity]  Type of indexed documents to get metadata
  --filter_by TEXT               Filter documents by pairs concatenated by
                                 coma of field names and values separated by
                                 =.Example "contentType=message/rfc822,content
                                 Type=message/rfc822"
  --count / --no-count           Count or not the number of docs for each
                                 property found

  --help                         Show this message and exit.

Aggregate

You can run aggregations on the data, the ElasticSearch aggregations API is partially enabled with this command. The possibilities are:

  • count: grouping by a given field different values, and count the num of docs.
  • nunique: returns the number of unique values of a given field.
  • date_histogram: returns counting of monthly or yearly grouped values for a given date field.
  • sum: returns the sum of values of number type fields.
  • min: returns the min of values of number type fields.
  • max: returns the max of values of number type fields.
  • avg: returns the average of values of number type fields.
  • stats: returns a bunch of statistics for a given number type fields.
  • string_stats: returns a bunch of string statistics for a given string type fields.
$ tarentula aggregate --help
Usage: tarentula aggregate [OPTIONS]

Options:
  --apikey TEXT                   Datashare authentication apikey
  --datashare-url TEXT            Datashare URL
  --datashare-project TEXT        Datashare project
  --elasticsearch-url TEXT        You can additionally pass the Elasticsearch
                                  URL in order to use scrollingcapabilities of
                                  Elasticsearch (useful when dealing with a
                                  lot of results)
  --query TEXT                    The query string to filter documents
  --cookies TEXT                  Key/value pair to add a cookie to each
                                  request to the API. You can
                                  separatesemicolons: key1=val1;key2=val2;...
  --traceback / --no-traceback    Display a traceback in case of error
  --type [Document|NamedEntity]   Type of indexed documents to download
  --group_by TEXT                 Field to use to aggregate results
  --operation_field TEXT          Field to run the operation on
  --run [count|nunique|date_histogram|sum|stats|string_stats|min|max|avg]
                                  Operation to run
  --calendar_interval [year|month]
                                  Calendar interval for date histogram
                                  aggregation
  --help                          Show this message and exit.

Following your changes

When running Elasticsearch changes on big datasets, it could take a very long time. As we were curling ES to see if the task was still running well, we added a small utility to follow the changes. It makes a live graph of a provided ES indicator with a specified filter.

It uses mathplotlib and python3-tk.

If you see the following message :

$ graph_es
graph_realtime.py:32: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure

Then you have to install tkinter, i.e. python3-tk for Debian/Ubuntu.

The command has the options below:

$ graph_es --help
Usage: graph_es [OPTIONS]

Options:
  --query               TEXT        Give a JSON query to filter documents. It can be
                                      a file with @path/to/file. Default to all.
  --index               TEXT        Elasticsearch index (default local-datashare)
  --refresh-interval    INTEGER     Graph refresh interval in seconds (default 5s)
  --field               TEXT        Field value to display over time (default "hits.total")
  --elasticsearch-url   TEXT        Elasticsearch URL which is used to perform
                                      update by query (default http://elasticsearch:9200)

Configuration File

Tarentula supports several sources for configuring its behavior, including an ini files and command-line options.

Configuration file will be searched for in the following order (use the first file found, all others are ignored):

  • TARENTULA_CONFIG (environment variable if set)
  • tarentula.ini (in the current directory)
  • ~/.tarentula.ini (in the home directory)
  • /etc/tarentula/tarentula.ini

It should follow the following format (all values bellow are optional):

[DEFAULT]
apikey = SECRETHALONOPROCTIDAE
datashare_url = http://here:8080
datashare_project = local-datashare

[logger]
syslog_address = 127.0.0.0
syslog_port = 514
syslog_facility = local7
stdout_loglevel = INFO

Testing

To test this tool, you must have Datashare and Elasticsearch running on your development machine.

After you installed Datashare, just run it with a test project/user:

datashare -p test-datashare -u test

In a separate terminal, install the development dependencies:

make install

Finally, run the test

make test

Releasing

The releasing process uses Poetry to manage versions, and a GitHub Actions workflow to publish both the Python package to PyPI and the multi-arch Docker image to Docker Hub whenever a GitHub release is published.

Each step below assumes you are on master with a clean working tree and that tests pass (make test).

1. Bump the version

Pick the semver level that matches your changes and run the corresponding target. This bumps pyproject.toml, creates a commit, and tags the new version locally:

make bump-patch   # backwards-compatible bug fixes
make bump-minor   # backwards-compatible features
make bump-major   # breaking changes

On success, the target prints the next steps with the new tag filled in.

2. Push the commit and tag

git push --follow-tags

This pushes the release commit along with the newly created tag to GitHub.

3. Create a GitHub release

Use the GitHub CLI to create a release with auto-generated notes from the commit history:

gh release create "$(git describe --tags --abbrev=0)" --generate-notes

Alternatively, open the new release page and select the tag manually.

Publishing the release triggers the Release workflow, which builds and publishes the package to PyPI and the multi-arch Docker image to Docker Hub. Watch the workflow run on the Actions tab to make sure both jobs succeed.

Manual fallback

If the CI workflow is unavailable, you can publish from your machine. This requires being a maintainer of the PyPI project and a member of the ICIJ organization on Docker Hub, with credentials configured locally.

Publish to PyPI:

make distribute

Build and push the multi-arch Docker image (run make docker-setup-multiarch once to configure buildx, see the Docker documentation):

make docker-publish

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarentula-4.5.0.tar.gz (36.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tarentula-4.5.0-py3-none-any.whl (39.9 kB view details)

Uploaded Python 3

File details

Details for the file tarentula-4.5.0.tar.gz.

File metadata

  • Download URL: tarentula-4.5.0.tar.gz
  • Upload date:
  • Size: 36.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tarentula-4.5.0.tar.gz
Algorithm Hash digest
SHA256 2e0198303003d59c9196f8928ba130b446d6c39d3f5d51d8acd7b1bddd132e7e
MD5 38657b9409e1fccff198f8f759dc0a60
BLAKE2b-256 55df22168283031aa04f66dd35bcc96b1a96a325dea29f9ee5e25cfeda60aa15

See more details on using hashes here.

Provenance

The following attestation bundles were made for tarentula-4.5.0.tar.gz:

Publisher: release.yml on ICIJ/datashare-tarentula

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tarentula-4.5.0-py3-none-any.whl.

File metadata

  • Download URL: tarentula-4.5.0-py3-none-any.whl
  • Upload date:
  • Size: 39.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tarentula-4.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d874793021a2b75b370b42f11710c8c7a9c20274f2fde18d738b5a581f779b27
MD5 1f45342c548af8da855ffcce9ca70583
BLAKE2b-256 78b9731ae0c42af4cfd0ae1c922492de1e983b482c57807d0f5c9cb174048da8

See more details on using hashes here.

Provenance

The following attestation bundles were made for tarentula-4.5.0-py3-none-any.whl:

Publisher: release.yml on ICIJ/datashare-tarentula

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page