Skip to main content

NCBI DNA Database Builder and Dataset Loader for ML pipelines (Enigma2).

Project description

EnigmaDB

Dataset generation pipeline for Enigma2 using NCBI database. Along with additional helper such as Dataset class to retireve & create batches for training ML models.

Prerequisites

Before setting up EnigmaDB, ensure that you have the following prerequisites installed:

  1. Python 3.8 or higher
  2. pip (Python package installer)

Dependencies

Installation

1. From PyPI

pip install enigmadatabase

2. Clone the Repo

git clone https://github.com/delveopers/EnigmaDataset.git
cd EnigmaDB

Documentation & Usage

Data gathering pipeline

from EnigmaDB import Database, EntrezQueries
queries = EntrezQueries()   # get queries

db = Database(topics=queries(), out_dir="./data/raw", email=EMAIL, api_key=API_KEY, retmax=1500, max_rate=10)   # set parameters
db.build(with_index=False)  # startbuilding

Creating Indexes

from EnigmaDB import create_index

create_index("./data/raw")    # add path to data

Converting versions

from EnigmaDB import convert_fasta

convert_fasta(input_dir="./data/raw", output_dir="./data/parquet", mode='parquet')  # for parquet
convert_fasta(input_dir="./data/raw", output_dir="./data/parquet", mode='csv')  # for csv

for more technical information, refer to documentation:

  1. Database.md
  2. Dataset.md

Project Structure

├── docs/
├── ├── Database.md
├── ├── Dataset.md
├── src/
├── ├── __init__.py
├── ├── _database.py    # ``Database`` class for downloading data from NCBI
├── ├── _dataset.py     # ``Dataset`` a dataloader class for enigma2
├── ├── _queries.py     # contains queries for DB pipeline
├── README.md
├── setup.py
├── pyproject.toml
├── requirements.txt  # List of Python dependencies

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository.

  2. Create a feature branch:

git checkout -b feature-name
  1. Commit your changes:
git commit -m "Add feature"
  1. Push to the branch:
 git push origin feature-name
  1. Create a pull request.

Please make sure to update tests as appropriate.

License

MIT License. Check out License for more info.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enigmadatabase-0.1.1.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

enigmadatabase-0.1.1-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file enigmadatabase-0.1.1.tar.gz.

File metadata

  • Download URL: enigmadatabase-0.1.1.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for enigmadatabase-0.1.1.tar.gz
Algorithm Hash digest
SHA256 33537bccc093acad6f49af5ce23ab463bcb4b29220d6f30b158c5a2e54e57f31
MD5 60beb1f679d51ae427aa2833edcfd727
BLAKE2b-256 54b6b108da375acc90e5663df9b9a1c68f332e93478c83ee672ccff9a98a0b99

See more details on using hashes here.

File details

Details for the file enigmadatabase-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: enigmadatabase-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for enigmadatabase-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 484367ff2c47f7cadae90f8078fe71d25a81e0727b33c268f40122c958728cba
MD5 b28816b53336be1c8f7c24d320b83a48
BLAKE2b-256 c28b84f3961398e934eb56e6274db74cbb7b05fe20e43466d3611b2ed7645a27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page