A package to access sciencedata.dk
Project description
sddk
This is a Python package for writting and reading files to/from sciencedata.dk. It is especially designed for working with shared folders. It relies mainly upon Python requests library.
sciencedata.dk is a project managed by DEiC (Danish e-infrastrcture cooperation) aimed to offer a robust data storage, data management and data publication solution for researchers in Denmark and abroad (see docs and dev for more info). The storage is accessible either through (1) the web interface, (2) WebDAV clients or (3) an API relaying on HTTP Protocol. One of the strength of sciencedata.dk is that it currently supports institutional login from 2976 research and educational institutions around the globe (using WAYF). That makes it a perfect tool for international research collaboration.
The main functionality of the package is in uploading any Python object (str, dict, list, dataframe or figure) as a file to a preselected personal or shared folder and getting it back into Python as the original Python object. It uses sciencedata.dk API in combination with Python requests library.
Install and import
To install and import the package within your Python environment (i.e. a jupyter notebook) run:
!pip install sddk # to be updated, use flag "--ignore-installed"
import sddk ### import all functions
Session configuration
To run the main configuration function below, you have to know the following:
- your sciencedata.dk username (e.g. "123456@au.dk" or "kase@zcu.cz"),
- your sciencedata.dk password (has to be previously configured in the sciencedata.dk web interface),
In the case you want to access a shared folder, you further need:
-
name of the shared folder you want to access (e.g. "our_shared_folder"),
-
username of the owner of the folder (if it is not yours)
(Do not worry, you will be asked to input these values interactively while running the function)
To configure a personal session, run:
conf = sddk.configure()
Configuration with root in shared folder
To configure a session pointing to a shared folder, run:
conf = sddk.configure("our_shared_folder", "owner_username@au.dk")
Running this function, you configure a tuple varible conf
, containing two objects:
s
: a request session authorized by your username and passwordsddk_url
: default url address (endpoint) for your requests
conf
is later on used as input for write_file()
and read_file()
.
write_file()
The most important components of the package are two functions: write_file(path_and_filename, python_object, conf)
and read_file(path_and_filename, type_of_object, conf)
.
So far these functions can be used with several different types of Python objects: str
, list
, dictionary
, pandas' dataframe
and matplotlib's figure
. These can be written either as .txt
, .json
or .png
files, based simply upon the filename's ending chosen by the user. Here are simple instances of these python objects to play with:
### Python "str" object
string_object = "string content"
### Python "list" object
list_object = ['a', 'b', 'c', 'd']
### Python "dictionary" object
dict_object = {"a" : 1, "b" : 2, "c":3 }
### Pandas dataframe object
import pandas as pd
dataframe_object = pd.DataFrame([("a1", "b1", "c1"), ("a2", "b2", "c2")], columns=["a", "b", "c"])
### Matplotlib figure object
import matplotlib.pyplot as plt
figure_object = plt.figure() # generate object
plt.plot(range(10)) # fill it by plotted values
### (the same also works for plotly figures)
The simplest example is once we want to write a string object into a textfile located at our home folder (Remember, that since the configuration this home folder is contained within the sddk_url
variable )
sddk.write_file("test_string.txt", string_object, conf)
In the case that everything is fine, you will receive following message:
> Your <class 'str'> object has been succefully written as "https://sciencedata.dk/files/test_string.txt"
However, there is a couple of things which might go wrong - You can choose an unsupported python object, a non-existent path or unsupported file format. The function captures some of these cases. For instance, once you run sddk.write_file("nonexistent_folder/filename.wtf", string_object, conf)
, you will be interactively asked for corrections. First: the function checks whether the path is correct. When corrected to an existent folder (here it is "personal_folder"), the function further inspect whether it has known ending (i.e. txt
, json
, feather
, or png
). If not, it asks you interactively for correction. Third, it checks whether the folder already contain a file of the same name (to avoid unintended overwritting), and if yes, asks you what to do. Finally, it prints out where you can find your file and what type of object it encapsulates.
>>> The path is not valid. Try different path and filename: textfile.wtf
>>> Unsupported file format. Type either "txt", "json", or "png": txt
>>> A file with the same name ("textfile.txt") already exists in this location.
Press Enter to overwrite it or choose different path and filename: textfile2.txt
>>> Your <class 'str'> object has been succefully written as "https://sciencedata.dk/files/textfile2.txt"
The same function works with dictionaries, lists, Matplotlib's figures and especially Pandas' dataframes. Pandas' dataframe is my favorite. I send there and back 1GB+ dataframes as json or feather files on a daily basis. See examples below
read_file()
On the other side, we have the function sddk.read_file(path_and_filename, object_type)
, which enables us to to read our files back to python as chosen python objects. Currently, the function can read only textfiles as strings, and json files as either dictionary, lists or Pandas's dataframes. You have to specify the type of object as the second argument, the values are either "str", "list", "dict" or "df" within quotation marks, like in these examples:
string_object = read_file("test_string.txt", "str", conf)
string_object
>>> 'string content'
list_object = read_file("simple_list.json", "list", conf)
list_object
>>> ['a', 'b', 'c', 'd']
dict_object = read_file("simple_dict.json", "dict", conf)
dict_object
>>> {'a': 1, 'b': 2, 'c': 3}
dataframe_object = read_file("simple_df.json", "df", conf)
>>> a b c
0 a1 b1 c1
1 a2 b2 c2
Examples
pandas.DataFrame to .json
and back
import pandas as pd
dataframe_object = pd.DataFrame([("a1", "b1", "c1"), ("a2", "b2", "c2")], columns=["a", "b", "c"])
dataframe_object
a | b | c | |
---|---|---|---|
0 | a1 | b1 | c1 |
1 | a2 | b2 | c2 |
sddk.write_file("simple_dataframe.json", dataframe_object, conf)
> Your <class 'pandas.core.frame.DataFrame'> object has been succefully written as "https://sciencedata.dk/files/simple_dataframe.json"
sddk.read_file("simple_dataframe.json", "df", conf)
a | b | c | |
---|---|---|---|
0 | a1 | b1 | c1 |
1 | a2 | b2 | c2 |
Reading a larger file from a public folder
%%time
EDH_sample = sddk.read_file("EDH_sample.json", "df", "8fe7d59de1eafe5f8eaebc0044534606")
EDH_sample.head(5)
# this is an example usage of public folder, see below for explanation.
diplomatic_text | literature | trismegistos_uri | id | findspot_ancient | not_before | type_of_inscription | work_status | edh_geography_uri | not_after | ... | external_image_uris | religion | fotos | geography | military | social_economic_legal_history | coordinates | text_cleaned | origdate_text | objecttype | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | D M / NONIAE P F OPTATAE / ET C IVLIO ARTEMONI... | AE 1983, 0192.; M. Annecchino, Puteoli 4/5, 19... | https://www.trismegistos.org/text/251193 | HD000001 | Cumae, bei | 0071 | epitaph | provisional | https://edh-www.adw.uni-heidelberg.de/edh/geog... | 0130 | ... | None | None | None | None | None | None | 40.8471577,14.0550756 | Dis Manibus Noniae Publi filiae Optatae et Cai... | 71 AD – 130 AD | [Tafel, 257] |
1 | C SEXTIVS PARIS / QVI VIXIT / ANNIS LXX | AE 1983, 0080. (A); A. Ferrua, RAL 36, 1981, 1... | https://www.trismegistos.org/text/265631 | HD000002 | Roma | 0051 | epitaph | no image | https://edh-www.adw.uni-heidelberg.de/edh/geog... | 0200 | ... | None | None | None | None | None | None | 41.895466,12.482324 | Caius Sextius Paris qui vixit annis LXX ... | 51 AD – 200 AD | [Tafel, 257] |
2 | [ ]VMMIO [ ] / [ ]ISENNA[ ] / [ ] XV[ ] / [ ] / [ | AE 1983, 0518. (B); J. González, ZPE 52, 1983,... | https://www.trismegistos.org/text/220675 | HD000003 | None | 0131 | honorific inscription | provisional | https://edh-www.adw.uni-heidelberg.de/edh/geog... | 0170 | ... | None | None | None | None | None | None | 37.37281,-6.04589 | Publio Mummio Publi filio Galeria Sisennae Rut... | 131 AD – 170 AD | [Statuenbasis, 57] |
3 | [ ]AVS[ ]LLA / M PORCI NIGRI SER / DOMINAE VEN... | AE 1983, 0533. (B); A.U. Stylow, Gerión 1, 198... | https://www.trismegistos.org/text/222102 | HD000004 | Ipolcobulcula | 0151 | votive inscription | checked with photo | https://edh-www.adw.uni-heidelberg.de/edh/geog... | 0200 | ... | [http://cil-old.bbaw.de/test06/bilder/datenban... | names of pagan deities | None | None | None | None | 37.4442,-4.27471 | AVSLLA Marci Porci Nigri serva dominae Veneri ... | 151 AD – 200 AD | [Altar, 29] |
4 | [ ] L SVCCESSVS / [ ] L L IRENAEVS / [ ] C L T... | AE 1983, 0078. (B); A. Ferrua, RAL 36, 1981, 1... | https://www.trismegistos.org/text/265629 | HD000005 | Roma | 0001 | epitaph | no image | https://edh-www.adw.uni-heidelberg.de/edh/geog... | 0200 | ... | None | None | None | None | None | None | 41.895466,12.482324 | libertus Successus Luci libertus Irenaeus C... | 1 AD – 200 AD | [Stele, 250] |
5 rows × 40 columns
pandas.DataFrame to .feather
and back
This might cause issues because of the way how pandas implements pyarrow and feather. To work with feather, check that you have installed a correct version of pyarrow
package:
import pyarrow
pyarrow.__version__
You need 0.17.1 or higher. Google colab comes with 0.14.1 by default, so you have to upgrade:
!pip install pyarrow --upgrade
and restart your runtime.
Originally, sddk 1.9-2.4 specified the requirement pyarrow>=0.17.1
, but it produced a lot of conflicts during an installation on Google colab, since there many other packages requiring pyarrow==0.14.1. Therefore, pyarrow is currently bypassed.
sddk.write_file("simple_dataframe.feather", dataframe_object, conf)
> Your <class 'pandas.core.frame.DataFrame'> object has been succefully written as "https://sciencedata.dk/files/simple_dataframe.feather"
sddk.read_file("simple_dataframe.feather", "df", conf)
a | b | c | |
---|---|---|---|
0 | a1 | b1 | c1 |
1 | a2 | b2 | c2 |
Reading a larger file from public folder
%%time
EDH_sample = sddk.read_file("EDH_sample.feather", "df", "8fe7d59de1eafe5f8eaebc0044534606")
EDH_sample.head(5)
diplomatic_text | literature | trismegistos_uri | id | findspot_ancient | not_before | type_of_inscription | work_status | edh_geography_uri | not_after | ... | external_image_uris | religion | fotos | geography | military | social_economic_legal_history | coordinates | text_cleaned | origdate_text | objecttype | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | D M / NONIAE P F OPTATAE / ET C IVLIO ARTEMONI... | AE 1983, 0192.; M. Annecchino, Puteoli 4/5, 19... | https://www.trismegistos.org/text/251193 | HD000001 | Cumae, bei | 0071 | epitaph | provisional | https://edh-www.adw.uni-heidelberg.de/edh/geog... | 0130 | ... | NaN | None | NaN | None | None | None | 40.8471577,14.0550756 | Dis Manibus Noniae Publi filiae Optatae et Cai... | 71 AD – 130 AD | NaN |
1 | C SEXTIVS PARIS / QVI VIXIT / ANNIS LXX | AE 1983, 0080. (A); A. Ferrua, RAL 36, 1981, 1... | https://www.trismegistos.org/text/265631 | HD000002 | Roma | 0051 | epitaph | no image | https://edh-www.adw.uni-heidelberg.de/edh/geog... | 0200 | ... | NaN | None | NaN | None | None | None | 41.895466,12.482324 | Caius Sextius Paris qui vixit annis LXX ... | 51 AD – 200 AD | NaN |
2 | [ ]VMMIO [ ] / [ ]ISENNA[ ] / [ ] XV[ ] / [ ] / [ | AE 1983, 0518. (B); J. González, ZPE 52, 1983,... | https://www.trismegistos.org/text/220675 | HD000003 | None | 0131 | honorific inscription | provisional | https://edh-www.adw.uni-heidelberg.de/edh/geog... | 0170 | ... | NaN | None | NaN | None | None | None | 37.37281,-6.04589 | Publio Mummio Publi filio Galeria Sisennae Rut... | 131 AD – 170 AD | NaN |
3 | [ ]AVS[ ]LLA / M PORCI NIGRI SER / DOMINAE VEN... | AE 1983, 0533. (B); A.U. Stylow, Gerión 1, 198... | https://www.trismegistos.org/text/222102 | HD000004 | Ipolcobulcula | 0151 | votive inscription | checked with photo | https://edh-www.adw.uni-heidelberg.de/edh/geog... | 0200 | ... | NaN | names of pagan deities | NaN | None | None | None | 37.4442,-4.27471 | AVSLLA Marci Porci Nigri serva dominae Veneri ... | 151 AD – 200 AD | NaN |
4 | [ ] L SVCCESSVS / [ ] L L IRENAEVS / [ ] C L T... | AE 1983, 0078. (B); A. Ferrua, RAL 36, 1981, 1... | https://www.trismegistos.org/text/265629 | HD000005 | Roma | 0001 | epitaph | no image | https://edh-www.adw.uni-heidelberg.de/edh/geog... | 0200 | ... | NaN | None | NaN | None | None | None | 41.895466,12.482324 | libertus Successus Luci libertus Irenaeus C... | 1 AD – 200 AD | NaN |
5 rows × 40 columns
sddk.write_file("EDH_sample.feather", EDH_sample, conf)
> Your <class 'pandas.core.frame.DataFrame'> object has been succefully written as "https://sciencedata.dk/files/EDH_sample.feather"
pandas.DataFrame to .csv
and back
import pandas as pd
dataframe_object = pd.DataFrame([("a1", "b1", "c1"), ("a2", "b2", "c2")], columns=["a", "b", "c"])
dataframe_object
a | b | c | |
---|---|---|---|
0 | a1 | b1 | c1 |
1 | a2 | b2 | c2 |
sddk.write_file("simple_dataframe.csv", dataframe_object, conf)
> Your <class 'pandas.core.frame.DataFrame'> object has been succefully written as "https://sciencedata.dk/files/simple_dataframe.csv"
sddk.read_file("simple_dataframe.csv", "df", conf)
a | b | c | |
---|---|---|---|
0 | a1 | b1 | c1 |
1 | a2 | b2 | c2 |
list_filenames()
This function enables you to list all files within a directory. You can specify the directory, type of the file you are interested in and the conf variable. For instance, the function belows returns all JSON files within your main directory.
sddk.list_filenames(filetype="json", conf=conf)
Personal, shared and public folders
Shared in and out
One of the main strength of the sciencedata.dk are collaborative features, namely the way you can manage its shared and public folders.
Shared folders always have one of two forms: either (1) a shared folder you share with some users or (2) a shared folder someone else shares with you.
Each shared folder has its owner. The folders are located in their owner's personal space and can be easily accessed by them from there like any other personal folder.
However, in the case of shared folders you do not own (i.e. which were shared with you by someone else) you also need to know the username of their owner.
One of the key features of the sddk package is that it enables you to access both types of shared folders using exactly the same command, regardless you are their owner or not. This enables that all members of a team accessing a folder owned and shared by one member can you use the same code. The function just checks both options and chooses what works.
For instance, a project member with username member1@inst.org
created a folder in his personal space called team_folder
, uploaded there a file called textfile.txt
, and shared the folder with his teammates with usernames member2@inst.org
and member3@inst.org
. All of them can now access the file using the same series of commands:
# configure session with access to the shared folder:
conf = sddk.configure("team_folder", "member1@inst.org")
# read the file located in this shared folder:
sddk.read_file("testfile.txt", "str", conf)
Public files and folders
Sciencedata.dk also enables to produce public files and folders. These files and folders might be accessed using sddk.read_file()
function even without having sciencedata.dk account. You just have to know share link code of the file or folder. To read a public file, you can use:
public_file_code = "3e0a55a4182de313e04523360cecd015"
gospels_cleaned = sddk.read_file("https://sciencedata.dk/public/" + public_file_code, "dict")
To read a specific file within a public folder, you can use the code below, i.e. you can replace the conf
parameter by sharing code of the public folder.
c_aristotelicum = sddk.read_file("https://sciencedata.dk/public/" + public_folder_code + "/c_aristotelicum.json", "df", "31b393e2afe1ee96ce81869c7efe18cb")
Credit
The package is continuously develepod and maintained by Vojtěch Kaše as a part of the digital collaborative research workflow of the SDAM project at Aarhus University, Denmark. To cite this package, use:
Version history
- 2.9 -
.eps
file format for matplotlib figures support (plotly works only with.png
) - 2.8.2 - plotly support
- 2.7 - resolving issues #1 (reading public json files) & #2 (beautifulsoup import)
- 2.6 - pyarrow avoided
- 2.5 - pyarrow version changed back to unspecified
- 2.4 - json encoding bug removed
- 2.3 - json encoding
- 2.2 - setup.py update
- 2.1 - README.md update
- 2.0 - tested with
.txt
,.json
,.feather
and.png
. - 1.9 - supports public files and folders; supports
.feather
file format (utf.8
enforced) - 1.8 -
list_filenames()
function andconfigure()
alias added - 1.7 - figures
- 1.6.1 - bug
- 1.6 - enables writing dataframes as
csv
- 1.5 - reads individually shared files without necessary configuration
- 1.4 -
json
package dependency - 1.3 -
conf
corrected - 1.2 -
conf
variable added - 1.1 - a simple correction
- 1.0 - functions
write_file()
andread_file()
added - 0.1.2 - redirection added
- 0.1.1 - added shared folder owner argument to the main configuration function; migration from test.pypi to real pypi
- 0.0.8 - shared folders reading&writing for ordinary users finally functional
- 0.0.7 - configuration of individual session by default
- 0.0.6 - first functional configuration
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.