Collect data from popular data repositories with ease.
Project description
PyCurator
Making data extraction and curation as easy as py.
PyCurator allows users to easily query research repositories without the trouble of reading through API
documentation. Data curation is now as easy as $ pycurator
. Whether you want the ease of clicking
some buttons and getting the data or the flexibility of modifying query format, PyCurator provides a simple
UI for quickly retrieving data that is built on top of an extensible collection of Web and API scraper classes.
Supported Repositories
PyCurator currently supports the following repositories in the capacities listed. Authentication is only required for Kaggle, though may provide runtime benefits for Dryad as rate-limiting is relaxed.
Repository | Authentication |
---|---|
Dataverse | |
Dryad | Dryad |
Figshare | |
Kaggle | Kaggle |
OpenML | |
Papers With Code | |
Zenodo |
If there's a repository that you would like to see added to the list, check out the Contributions section.
Installation and use
Installation
Dependencies are provided in the requirements.txt
file.
It is recommended to create a virtual environment to ensure there is no conflict with the packages
in your current work space.
PyCurator requires a Python version >= 3.10.
To run, simply paste the following commands into your terminal
git clone https://github.com/michaelbaluja/PyCurator.git
cd PyCurator
python -m pip install -e .
pycurator
Use
Repository Selection
After following the commands above, you will be met with the landing page, containing licensing, funding, and
copyright information. Clicking Continue
will bring you to the following page
Parameter Selection
Clicking on one of the repositories will bring up the respective parameters used for querying the API and saving your results. Parameters will vary depending on repository selected.
These parameters are outlined as
Parameter | Description |
---|---|
Save Directory | Location to save results. Defaults to "/data/{repo_name}/{search_term}_{search_type}.json" within PyCurator /data sub-directory. |
Search Terms | Search term(s) to query. Terms should be separated with a comma, and multi-word terms should be wrapped in quotes. |
Search Types | Type of objects to query. |
After all required parameters are provided, the Run button is activated. |
Run Page
The run page provides high level status updates in the main window. These include the beginning and end of processes, rate-limiting issues, runtime completion, and saving confirmation. Below are real-time status updates for the specific query being completed as well as a progress bar for the high level task. During tasks that have a fixed duration, such as metadata querying or some web scraping, a fixed-length progress bar will show the progression of output. During tasks that have an indeterminate duration, a cycling task bar will be present to represent continued progress.
At the bottom are the navigation buttons. To avoid unnecessary queries, the Back
button is unresponsive
during runtime, but is activated after completion. The Stop
button is used to interrupt runtime and stop querying.
After runtime completion or interruption, the Stop
button is replaced by the Exit
button, allowing you to
safely terminate the program.
Contributions
Bugs
Please note that as of Spring 2022, PyCurator is still undergoing active development. For any bugs or problems that you come across, open an issue that details the problem that you're experiencing.
Extension
Know of an API that you think should be included in PyCurator? Create a Pull Request outlining the API and why you think it would be beneficial, and make sure to follow the format set out through the existing Scraper classes.
Funding and acknowledgements
The initial development of this program was funded by the Librarians Association of The University of California (LAUC) and UC San Diego Library Research Data Curation Program (RDCP).
Thank you to Matt Peters, Dan LaSusa, John Chen, Joshua Weimer, and Amy Ly for their feedback during testing of early iterations of PyCurator.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pycurator-0.1b1.tar.gz
.
File metadata
- Download URL: pycurator-0.1b1.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06d9b2e8bda2e89317d5093032d42a8b22de6d98f5a514110079ece11c1f3470 |
|
MD5 | 38be765a45e58a54c2a7c1fd7fc0ba74 |
|
BLAKE2b-256 | 879f09459180c1da05767dd1a6b4f41133755d45206d2442697944802a7b6092 |
File details
Details for the file pycurator-0.1b1-py3-none-any.whl
.
File metadata
- Download URL: pycurator-0.1b1-py3-none-any.whl
- Upload date:
- Size: 43.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.8.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0dcd882f355ee9a2864fa9906a632dea17bc21fdbac71bf772f101fa40c90da5 |
|
MD5 | c23baf54ce507ceb44021cf56205eb4e |
|
BLAKE2b-256 | df802eaf40369cf32fb92c470c86450a86866b4219623e97eadd04253388f935 |