Skip to main content

NCBI DNA Database Builder and Dataset Loader for ML pipelines (Enigma2).

Project description

EnigmaDB

Dataset generation pipeline for Enigma2 using NCBI database. Along with additional helper such as Dataset class to retireve & create batches for training ML models.

Prerequisites

Before setting up EnigmaDB, ensure that you have the following prerequisites installed:

  1. Python 3.8 or higher
  2. pip (Python package installer)

Dependencies

Installation

1. From PyPI

pip install enigmadb

2. Clone the Repo

git clone https://github.com/delveopers/EnigmaDataset.git
cd EnigmaDB

Documentation & Usage

Data gathering pipeline

from EnigmaDB import Database, EntrezQueries
queries = EntrezQueries()   # get queries

db = Database(topics=queries(), out_dir="./data/raw", email=EMAIL, api_key=API_KEY, retmax=1500, max_rate=10)   # set parameters
db.build(with_index=False)  # startbuilding

Creating Indexes

from EnigmaDB import create_index

create_index("./data/raw")    # add path to data

Converting versions

from EnigmaDB import convert_fasta

convert_fasta(input_dir="./data/raw", output_dir="./data/parquet", mode='parquet')  # for parquet
convert_fasta(input_dir="./data/raw", output_dir="./data/parquet", mode='csv')  # for csv

for more technical information, refer to documentation:

  1. Database.md
  2. Dataset.md

Project Structure

├── docs/
├── ├── Database.md
├── ├── Dataset.md
├── src/
├── ├── __init__.py
├── ├── _database.py    # ``Database`` class for downloading data from NCBI
├── ├── _dataset.py     # ``Dataset`` a dataloader class for enigma2
├── ├── _queries.py     # contains queries for DB pipeline
├── README.md
├── setup.py
├── pyproject.toml
├── requirements.txt  # List of Python dependencies

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository.

  2. Create a feature branch:

git checkout -b feature-name
  1. Commit your changes:
git commit -m "Add feature"
  1. Push to the branch:
 git push origin feature-name
  1. Create a pull request.

Please make sure to update tests as appropriate.

License

MIT License. Check out License for more info.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enigmadatabase-0.1.0.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

enigmadatabase-0.1.0-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file enigmadatabase-0.1.0.tar.gz.

File metadata

  • Download URL: enigmadatabase-0.1.0.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for enigmadatabase-0.1.0.tar.gz
Algorithm Hash digest
SHA256 657646de6b20c0cfd2ad86a1a0815b27cf0504663bea9070925e74d215c1fd19
MD5 c1ef1021e81fedf4d71d00fe634f6a37
BLAKE2b-256 c8636c93ba491af87b1c3e1f4cb2e07f878446be85fb6cc90ca980d4b23cea21

See more details on using hashes here.

File details

Details for the file enigmadatabase-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: enigmadatabase-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for enigmadatabase-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1e8652c751ab6d7cf04cbb11f4e5a063782314e93a1f72d244667d24e5d91dc4
MD5 c7fb1aa9cda2fde50c19a7bc8fe25e81
BLAKE2b-256 92fd30488ea3f9f02df87d94a5585ea5551f5d4e47c1e4d5b671f2f1176a06e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page