NCBI DNA Database Builder and Dataset Loader for ML pipelines (Enigma2).
Project description
EnigmaDB
Dataset generation pipeline for Enigma2 using NCBI database. Along with additional helper such as Dataset class to retireve & create batches for training ML models.
Prerequisites
Before setting up EnigmaDB, ensure that you have the following prerequisites installed:
- Python 3.8 or higher
- pip (Python package installer)
Dependencies
Installation
1. From PyPI
pip install enigmadatabase
2. Clone the Repo
git clone https://github.com/delveopers/EnigmaDataset.git
cd EnigmaDB
Documentation & Usage
Data gathering pipeline
from EnigmaDB import Database, EntrezQueries
queries = EntrezQueries() # get queries
db = Database(topics=queries(), out_dir="./data/raw", email=EMAIL, api_key=API_KEY, retmax=1500, max_rate=10) # set parameters
db.build(with_index=False) # startbuilding
Creating Indexes
from EnigmaDB import create_index
create_index("./data/raw") # add path to data
Converting versions
from EnigmaDB import convert_fasta
convert_fasta(input_dir="./data/raw", output_dir="./data/parquet", mode='parquet') # for parquet
convert_fasta(input_dir="./data/raw", output_dir="./data/parquet", mode='csv') # for csv
for more technical information, refer to documentation:
Project Structure
├── docs/
├── ├── Database.md
├── ├── Dataset.md
├── src/
├── ├── __init__.py
├── ├── _database.py # ``Database`` class for downloading data from NCBI
├── ├── _dataset.py # ``Dataset`` a dataloader class for enigma2
├── ├── _queries.py # contains queries for DB pipeline
├── README.md
├── setup.py
├── pyproject.toml
├── requirements.txt # List of Python dependencies
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
-
Fork the repository.
-
Create a feature branch:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add feature"
- Push to the branch:
git push origin feature-name
- Create a pull request.
Please make sure to update tests as appropriate.
License
MIT License. Check out License for more info.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file enigmadatabase-0.1.1.tar.gz.
File metadata
- Download URL: enigmadatabase-0.1.1.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33537bccc093acad6f49af5ce23ab463bcb4b29220d6f30b158c5a2e54e57f31
|
|
| MD5 |
60beb1f679d51ae427aa2833edcfd727
|
|
| BLAKE2b-256 |
54b6b108da375acc90e5663df9b9a1c68f332e93478c83ee672ccff9a98a0b99
|
File details
Details for the file enigmadatabase-0.1.1-py3-none-any.whl.
File metadata
- Download URL: enigmadatabase-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
484367ff2c47f7cadae90f8078fe71d25a81e0727b33c268f40122c958728cba
|
|
| MD5 |
b28816b53336be1c8f7c24d320b83a48
|
|
| BLAKE2b-256 |
c28b84f3961398e934eb56e6274db74cbb7b05fe20e43466d3611b2ed7645a27
|