Cli toolbelt for Datashare.
Project description
Datashare Tarentula
Cli toolbelt for Datashare.
/ \
\ \ ,, / /
'-.`\()/`.-'
.--_'( )'_--.
/ /` /`""`\ `\ \
| | >< | |
\ \ / /
'.__.'
Usage: tarentula [OPTIONS] COMMAND [ARGS]...
Options:
--syslog-address TEXT localhost Syslog address
--syslog-port INTEGER 514 Syslog port
--syslog-facility TEXT local7 Syslog facility
--stdout-loglevel TEXT ERROR Change the default log level for stdout error handler
--help Show this message and exit
Commands:
tagging
tagging-by-query
clean-tags-by-query
download
Tagging
A command to batch tag documents with a CSV file.
Usage: tarentula tagging [OPTIONS] CSV_PATH
Options:
--datashare-url TEXT http://localhost:8080 Datashare URL
--datashare-project TEXT local-datashare Datashare project
--throttle INTEGER 0 Request throttling (in ms)
--cookies TEXT _Empty string_ Key/value pair to add a cookie to each request to the API. You can separate semicolons: key1=val1;key2=val2;...
--apikey TEXT None Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar Display a progressbar
--help Show this message and exit
CSV formats
Tagging with a documentId
and routing
:
tag,documentId,routing
Actinopodidae,l7VnZZEzg2fr960NWWEG,l7VnZZEzg2fr960NWWEG
Antrodiaetidae,DWLOskax28jPQ2CjFrCo
Atracidae,6VE7cVlWszkUd94XeuSd,vZJQpKQYhcI577gJR0aN
Atypidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
Barychelidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
Tagging with a documentUrl
:
tag,documentUrl
Mecicobothriidae,http://localhost:8080/#/d/local-datashare/DbhveTJEwQfJL5Gn3Zgi/DbhveTJEwQfJL5Gn3Zgi
Microstigmatidae,http://localhost:8080/#/d/local-datashare/iuL6GUBpO7nKyfSSFaS0/iuL6GUBpO7nKyfSSFaS0
Migidae,http://localhost:8080/#/d/local-datashare/BmovvXBisWtyyx6o9cuG/BmovvXBisWtyyx6o9cuG
Nemesiidae,http://localhost:8080/#/d/local-datashare/vZJQpKQYhcI577gJR0aN/vZJQpKQYhcI577gJR0aN
Paratropididae,http://localhost:8080/#/d/local-datashare/vYl1C4bsWphUKvXEBDhM/vYl1C4bsWphUKvXEBDhM
Porrhothelidae,http://localhost:8080/#/d/local-datashare/fgCt6JLfHSl160fnsjRp/fgCt6JLfHSl160fnsjRp
Theraphosidae,http://localhost:8080/#/d/local-datashare/WvwVvNjEDQJXkwHISQIu/WvwVvNjEDQJXkwHISQIu
Tagging by Query
A command that uses Elasticsearch update-by-query
feature to batch tag documents directly in the index.
To see an example of input file, refer to this JSON.
Usage: tarentula tagging-by-query [OPTIONS] JSON_PATH
Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar
Display a progressbar
--wait-for-completion / --no-wait-for-completion
Create a Elasticsearch task to perform the
updateasynchronously
--help Show this message and exit.
Clean Tags by Query
A command that uses Elasticsearch update-by-query
feature to batch untag documents directly in the index.
Usage: tarentula clean-tags-by-query [OPTIONS]
Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--wait-for-completion / --no-wait-for-completion
Create a Elasticsearch task to perform the
updateasynchronously
--query TEXT Give a JSON query to filter documents that
will have their tags cleaned. It can be
afile with @path/to/file. Default to all.
--help Show this message and exit.
Download
A command to download all files matching a query.
Usage: tarentula download [OPTIONS]
Options:
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--destination-directory TEXT Directory documents will be downloaded
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--path-format TEXT Downloaded document path template
--scroll TEXT Scroll duration
--source TEXT A comma-separated list of field to include
in the downloaded document from the index
--once / --not-once Download file only once
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar
Display a progressbar
--raw-file / --no-raw-file Download raw file from Datashare
--type [Document|NamedEntity] Type of indexed documents to download
--help Show this message and exit.
Testing
To test this tool, you must have Datashare and Elasticsearch running on your development machine.
You can adapt the tests to your dev environment. For example in a dockerized env, with the following variables :
export TEST_ELASTICSEARCH_URL=http://elasticsearch:9200
export TEST_DATASHARE_URL=http://datashare:8080
the tests will use http://elasticsearch:9200
and http://datashare:8080
After you installed Datashare, just run it with a test project/user:
datashare -p test-datashare -u test
In a separate terminal, install the development dependencies:
sudo apt install pipenv
make install
Finally, run the test
make test
following your changes
When running Elasticsearch changes on big datasets, it could take a very long time. As we were curling ES to see if the task was still running well, we added a small utility to follow the changes. It makes a live graph of a provided ES indicator with a specified filter.
It uses mathplotlib and python3-tk.
If you see the following message :
$ graph_es
graph_realtime.py:32: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure
Then you have to install tkinter, i.e. python3-tk for Debian/Ubuntu.
The command has the options below:
$ graph_es --help
Usage: graph_es [OPTIONS]
Options:
--query TEXT Give a JSON query to filter documents. It can be
a file with @path/to/file. Default to all.
--index TEXT Elasticsearch index (default local-datashare)
--refresh-interval INTEGER Graph refresh interval in seconds (default 5s)
--field TEXT Field value to display over time (default "hits.total")
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query (default http://elasticsearch:9200)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tarentula-2.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ccae789f001998e75f4448fed0cfd3e6890ddc4216dc45bd9c3690cc4c87b54f |
|
MD5 | a1baa15a5c25377f38639474450abd18 |
|
BLAKE2b-256 | 394c515a91df15c48c4b152cf37c7479d7cee0c41f210e8a09bd9a1981d363f6 |