Use biological data to generate SQL database schema
Project description
BioMetaDB v0.1.3
Installation
pip3 install wheel && pip3 install BioMetaDB
Such dependencies are best maintained within a separate Python virtual environment.
About
BioMetaDB is a data-management package that automatically generates database schemas, minimizing the amount of temporary files used and handling migrations and updates smoothly. This SQLAlchemy-based wrapper package allows researchers and data engineers to quickly generate Python classes and objects for fast data-searching, abstracting searches to a SQL-based API and allowing end-users to focus on query building for analysis. Finally, this schema is specifically useful for tsv-based output, such as output from annotation suites, and will allow users to assign output from these files to existing data records and files quickly and efficiently.
Database projects are organized into file-systems:
database/
classes/
db_table_1.json
db_table_2.json
config/
database.ini
db/
db_table_1/
data_file_1.faa
data_file_2.fna
data_file_3.fasta
db_table_2/
data_file_1.fastq
data_file_2.fq
migrations/
########.migrations.mgt
File systems created by BioMetaDB are accessible in Python scripts:
from BioMetaDB import get_table
table = get_table("/path/to/database", table_name="db_table_1")
# Alternatively,
# table = get_table("/path/to/database", alias="table_1")
table.query() # Returns all values
table.query("n50 >= 100000") # Simple filter query
table.query("n50 >= 100000 AND n75 <= 1000000") # More complex filter query
table.query("_id LIKE '%genome%'") # Has genome in name of record
# Additional SQLAlchemy operations...
BioMetaDB returns objects that directly inherit from SQLAlchemy data types, meaning that (most, if not all) functions that are available to SQLAlchemy objects are available here.
Why bother?
Data retention
Data is commonly output as .tsv or .csv files, and gathering data from multiple .tsv files (e.g. determining the most statistically likely annotation, combining output from multiple software suites, etc) can be a tedious process that frequently can require the creation of multiple temporary files. This software suite minimizes this problem by using the column names from .tsv files as the column names in the database table.
# For a .tsv file formatted as:
File_name Column.1 column_2
data_file_1.faa 25 4.7
data_file_2.fna 100 12.5
# dbdm would generate database objects with attributes accessible as:
table.query()
table[0]._id # Returns data_file_1.faa
table[0].column_1 # Returns 25
table[1].column_2 # Returns 12.5
getattr(table[1], "_id") # Returns data_file_2.fna
Notice that column names are automatically adjusted to match typical SQL requirements. This system translates column output from .tsv files into an SQL schema, saving data files to a project-level directory and automatically handling data storage, which allows for more centralized data tracking.
Additional functionality
For a complete summary of the functionality afforded by using the classes Record and RecordList, see the complete documentation to come.
-
Record
- all objects generated by the database.__repr()__
: returns a "pretty-print" summary of the recordfull_path()
: returns full path to data filewrite_temp()
: writes entire contents of data file to current directory as temp fileclear_temp()
: erases temporary data filerecords()
: calls SeqIO.parse() from the Bio package (optimized for larger files)
-
RecordList
- returned byget_table()
, a quasi list/dict data type for multiple records.__repr()__
: returns a "pretty-print" summary of the tablequery("query string")
: runs query through SQL databasecolumns()
: returns a dict of column names for the database tablecolumns_summary()
: returns a "pretty-print" summary of all of the column names in the database tablesummarize()
: returns a "pretty-print" summary (averages and std. devs) of the items in the listfind_column("search string")
: returns list of columns that are "like" the passed search_strsave()
: updates database with newly set valueswrite_records("out.fna")
: save all sequences in db view (e.g. result ofquery()
) to filewrite_tsv("out.tsv")
: save all data for current db view (e.g. result ofquery()
) to file- "quasi list/dict data type"
- Can iterate over a
RecordList
object - Can get
len()
ofRecordList
- e.g. how many records were returned byquery()
- Can return
keys()
,values()
, anditems()
- Can return records by index or by id
- Can iterate over a
update(self, data=None, directory_name="None", silent=False, integrity_cancel=False)
- Creates new database table schema and updates current values in database by referencing a an
UpdateData
data object
- Creates new database table schema and updates current values in database by referencing a an
-
UpdateData
- build a dataset with simple variable/attribute assignments- See the blog post for more info
Usage Best Practices
SQLAlchemy
- BioMetaDB uses this powerful library to generate and use databases, in this case SQLite3.
- SQLAlchemy is a powerful Python package that makes working with SQL easy and fairly uniform between relational database management systems (RdatabaseMSs).
- Read more about SQLAlchemy, its functionality, and its limitations: https://docs.sqlalchemy.org/en/latest/orm/tutorial.html
Data file integrity
- BioMetaDB uses .tsv files to generate names and data types of columns in the RdatabaseMS tables.
- Before running this software, ensure that data file column names are adequately formatted, with no trailing
\n
or\t
in the file.- Columns named
Column.1
andcolumn_1
will both override tocolumn_1
.
- Columns named
- Data types are automatically predicted based on a randomly selected value within the column.
- Ensure that all values in a given column store the same kind of data (e.g. Integer, Float, VARCHAR, etc).
- BioMetaDB matches names of data files (e.g. file_1.fasta) to record ids in the .tsv files, so make sure that they match.
Modifying project structure or source code
- BioMetaDB relies on the integrity of this file system that it creates when
dbdm INIT
is first called. It is ill-advised to manually delete files or records from the database or project, respectively. Updating records usingtable.save()
is the preferred method for updating database table values within code, but records themselves should only be removed usingdbdm DELETE
- Certain portions of the source code can be modified to better fit a user's needs.
BioMetaDB/Models/functions.py
holds the classRecord
, which is the parent class of all database objects generated usingget_table()
. Any additional functions added to this class will be available to your records.BioMetaDB/DataStructures/record_list.pyx
holds the classRecordList
, which is a simple container class for handling database records. Currently, this class functions as a quasi-dictionary-and-list data structure.BioMetaDB/Accessories/update_data.py
provides a dynamic data structure for storing information as object attributes and writing the result to file.
dbdm
dbdm is the primary method for making large-scale changes to the database schema or data files. Use this program when first creating your database or when making large updates, such as when adding/removing tables or updating/removing values or columns.
usage: dbdm.py [-h] [-n DB_NAME] [-t TABLE_NAME] [-d DIRECTORY_NAME]
[-f DATA_FILE] [-l LIST_FILE] [-c CONFIG_FILE] [-a ALIAS] [-s]
[-v VIEW] [-q QUERY] [-u UNIQUE] [-i] [-w WRITE] [-x WRITE_TSV]
[-p PATH] [-r]
program
dbdm: Manage BioMetaDB project
Available Programs:
CREATE: Create a new table in an existing database, optionally populate using data files
(Req: --config_file --table_name --directory_name --data_file --alias --silent --integrity_cancel)
DELETE: Delete list of ids from database tables, remove associated files
(Req: --config_file --table_name --list_file --alias --silent --integrity_cancel)
FIX: Repairs errors in DB structure using .fix file
(Req: --config_file --data_file --silent --integrity_cancel)
INIT: Initialize database with starting table, fasta directory, and/or data files
(Req: --db_name --table_name --directory_name --data_file --alias --silent --integrity_cancel)
INTEGRITY: Queries project database and structure to generate .fix file for possible issues
(Req: --config_file --table_name --alias --silent)
MOVE: Move project to new location
(Req: --config_file --path --integrity_cancel --silent)
REMOVE: Remove table and all associated data from database
(Req: --config_file --table_name --alias --silent --integrity_cancel)
REMOVECOL: Remove column list (including data) from table
(Req: --config_file --table_name --list_file --alias --silent --integrity_cancel)
SUMMARIZE: Summarize project and query data. Write records or metadata to file
(Req: --config_file --view --query --table_name --alias --write --write_tsv --unique --truncate)
UPDATE: Update values in existing table or add new sequences
(Req: --config_file --table_name --directory_name --data_file --alias --silent --integrity_cancel)
positional arguments:
program Program to run
optional arguments:
-h, --help show this help message and exit
-n DB_NAME, --db_name DB_NAME
Name of database
-t TABLE_NAME, --table_name TABLE_NAME
Name of database table
-d DIRECTORY_NAME, --directory_name DIRECTORY_NAME
Directory path with bio data (fasta or fastq)
-f DATA_FILE, --data_file DATA_FILE
.tsv or .csv file to add, or .fix file to integrate
-l LIST_FILE, --list_file LIST_FILE
File with list of items, typically ids or column names
-c CONFIG_FILE, --config_file CONFIG_FILE
/path/to/BioMetaDB-project-directory - Can omit if no other BioMetaDB project present
-a ALIAS, --alias ALIAS
Provide alias for locating and creating table class
-s, --silent Silence all standard output (Standard error still displays to screen)
-v VIEW, --view VIEW View (c)olumns or (t)ables with SUMMARIZE
-q QUERY, --query QUERY
evaluation ~> genome; function -> gen; eval >> fxn; eval ~> fxn -> gen;
-u UNIQUE, --unique UNIQUE
View unique values of column using SUMMARIZE
-i, --integrity_cancel
Cancel integrity check
-w WRITE, --write WRITE
Write fastx records from SUMMARIZE to outfile
-x WRITE_TSV, --write_tsv WRITE_TSV
Write table record metadata from SUMMARIZE to outfile
-p PATH, --path PATH New path for moving project in MOVE command
-r, --truncate Return only ID of annotation and not the complete description, excluding Prokka, default False
The typical workflow for initializing BioMetaDB is straightforward. Many options exist for updating, adding to,
and removing from the existing project structure. You may also choose to assign an alternate name to a table within a
database in the form of an alias. Depending on the number of entries involved, these commands can result in voluminous
output, which can either be silenced by passing -s
as parameters in your command, or by redirecting standard output
to a text file.
After every major change to the project, dbdm
will run an integrity check. This default can be cancelled by passing -i
with each command. However, integrity checks are optimized and useful, so users are advised not to cancel standard checks.
Currently, the following issues are identified in a tab-delimited .fix
file with each integrity check:
- PROJECT INVALID_PATH: working_dir path stored in config file is invalid
- RECORD BAD_TYPE: data type for record could not be determined
- RECORD BAD_LOCATION: stored file path for record does not exist
- FILE BAD_RECORD: file exists in project directory with no assigned database id
More information on using this .fix
file to repair issues can be found in the generated .fix
file, in the Accessory Scripts
section in this README, and in the provided Example/
directory.
INIT
INIT is used to create a new database project for the first time. This will create the project-level directory and all subdirectories, will copy data files to the new project system, and will commit all information from a provided .tsv file as the database table class schema. Notice that original file permission are maintained for any files copied to this project.
- Required flags
- --db_name (-n): Name to assign database (for example, Metagenomes)
- --table_name (-t): Name to give table in database (for example, Genomic)
- Optional flags
- --directory_name (-d): Path to directory with files to add to the database.
- --data_file (-f): Path to .tsv file to use to generate database table schema
- --alias (-a): Alias (short name) for table in database
- --silent (-s): Silence standard output
- --integrity_cancel (-i): Cancel integrity check
- Example
dbdm INIT -n database -t db_table_1 -d /path/to/data/directory/ -f /path/to/data_file.tsv -a table_1
- This command will generate the following project structure and assign an alias name to the table:
database/
classes/
db_table_1.json # Generated from data_file.tsv
config/
database.ini
db/
db_table_1/
# Files from /path/to/data/directory/
migrations/
At this point, the project is available to use in scripts in the manner mentioned in the About section.
UPDATE
UPDATE is used to add columns to an existing table and to add or modify large numbers of entries in the database.
This command creates and stores a deep copy of the existing database, updates the existing columns to include new
columns from a provided .tsv file, and adds new data files from a provided directory. The deep copy is saved in the
project-level migrations
folder.
- Required flags
- --config_file (-c): Path to project directory
- --table_name (-t): Name of table to update
- Optional flags
- --directory_name (-d):- Path to directory with files to add to the database.
- --data_file (-f): Path to .tsv file to use to generate database table schema
- --alias (-a): Alias (short name) for table in database
- --silent (-s): Silence standard output
- --integrity_cancel (-i): Cancel integrity check
- Example
dbdm UPDATE -c /path/to/database -t db_table_1 -f /path/to/data_file.tsv -d /path/to/data_dir/
- This command will update the table named
db_table_1
, which is part of the database in the projectdatabase
. This command will add new column data from the file/path/to/data_file.tsv
and will add new data files from the directory/path/to/data_dir/
. A copy of the database is stored in the project-directorymigrations
, named using the date that the command was run.
REMOVECOL
REMOVECOL is used to remove columns from a table within the database. All associated data is deleted, and the table is regenerated using existing data.
- Required flags
- --config_file (-c): Path to project directory
- --table_name (-t): Name of table to update
- --list_file (-l): Path to file with list of column names to remove from database table
- Optional flags
- --alias (-a): Alias (short name) for table in database
- --silent (-s): Silence standard output
- --integrity_cancel (-i): Cancel integrity check
- Example
dbdm REMOVECOL -c /path/to/database -t db_table_1 -l /path/to/list_to_remove.list
- This command will remove all columns provided in the newline-separated file
list_to_remove.list
fromdb_table_1
within the projectdatabase
.
- Possible exceptions
- ListFileNotProvidedError: List file not provided, pass path to a list
DELETE
DELETE removes records, and their associated files, from a table in the database.
- Required flags
- --config_file (-c): Path to project directory
- --table_name (-t): Name of table to update
- --list_file (-l): Path to file with list of record ids to remove from database table
- Optional flags
- --alias (-a): Alias (short name) for table in database
- --silent (-s): Silence standard output
- --integrity_cancel (-i): Cancel integrity check
- Example
dbdm REMOVECOL -c /path/to/database -t db_table_1 -l /path/to/list_to_remove.list
- This command will remove all record ids provided in the newline-separated file
list_to_remove.list
fromdb_table_1
within the projectdatabase
.
- Possible exceptions
- ListFileNotProvidedError: List file not provided, pass path to a list
CREATE
CREATE is used to add a new table to an existing project structure and database. This command updates an existing project structure.
- Required flags
- --config_file (-c): Path to project directory
- --table_name (-t): Name of table to create
- Optional flags
- --directory_name (-d): Path to directory with files to add to the database.
- --data_file (-f): Path to .tsv file to use to generate database table schema
- --alias (-a): Alias (short name) for table in database
- --silent (-s): Silence standard output
- --integrity_cancel (-i): Cancel integrity check
- Example
dbdm CREATE -c /path/to/database -t db_table_2 -d /path/to/data/directory/ -f /path/to/data_file.tsv -a table_2
- This command will update the project structure
database
to include the new table:
database/
classes/
db_table_1.json
db_table_2.json # Generated from data_file.tsv
config/
database.ini
db/
db_table_1/
db_table_2/
# Files from /path/to/data/directory/
migrations/
REMOVE
REMOVE is used to remove a table from an existing database, including all associated records and files. As this
command cannot be undone, you are recommended to run UPDATE
with only the required flags to store a migration copy of
your data as a .tsv file.
- Required flags
- --config_file (-c): Path to project directory
- --table_name (-t): Name of table to remove
- Optional flags
- --alias (-a): Alias (short name) for table in database
- --silent (-s): Silence standard output
- --integrity_cancel (-i): Cancel integrity check
- Example
dbdm REMOVE -c /path/to/database -t db_table_2
- This command will remove the table
db_table_2
from the projectdatabase
.
MOVE
MOVE is used to move an entire project to a different location on the user's filesystem.
- Required flags
- --config_file (-c): Path to project directory
- --path (-p): New project path (must exist)
- Optional flags
- --silent (-s): Silence standard output
- --integrity_cancel (-i): Cancel integrity check
SUMMARIZE
SUMMARIZE is used to display information about the project or a table within the project. This command can also run queries to the database and display summary data on only these selected records. See this blog for more info on query statements.
- Required flags
- --config_file (-c): Path to project directory
- Optional flags
- --table_name (-t): Name of table to summarize
- --alias (-a): Alias (short name) for table in database
- --view (-v): View (c)olumns or (t)ables
- --truncate (-r): Return only ID of annotation and not the complete description, excluding Prokka
- --query (-q): Attach SQL query to summarize select records only. Must be combined with
-t
or-a
flags - --write (-w): Write records in SQL query, or entire table, to directory. Must pass with
-t
or-a
flags - --write_tsv (-x): Write metadata for record in table to file. Must pass with
-t
or-a
flags
- Examples
dbdm SUMMARIZE -c /path/to/database
- This command will summarize all tables in the database. Per table, this command displays the number of records as well as averages and standard deviations for each column.
dbdm SUMMARIZE -t table_name -w output
ordbdm SUMMARIZE -t table_name -x out.tsv
- The first command writes all records in the table
table_name
tooutput/
, whereas the second command writes record metadata fromtable_name
toout.tsv
. dbdm SUMMARIZE -t table_name -v c
ordbdm SUMMARIZE -v t
- These commands display the columns in
table_name
and the names of all tables in the project, respectively. - See
Examples/README.md
for using the query option. - See the blog post for more information.
INTEGRITY
INTEGRITY checks the project structure and individual records for missing data files, improperly determined data types,
and errors in the project structure. INTEGRITY generates a .fix
file that lists all issues with default settings to
either do nothing or delete the record that caused the issue. Users can manually edit the file, or use
BioMetaDB/AccessoryScripts/edit_fix_file.py
to quickly set values for many records. See the header in a generated .fix
file for instructions on setting values.
- Required flags
- --config_file (-c): Path to project directory
- Example
dbdm INTEGRITY -c /path/to/database
- This command will identify all integrity issues in the project
database
and will generate a########.###.fix
file.
FIX
FIX uses the output .fix
file generated in INTEGRITY to repair database issues. See the header in a generated .fix
file for instructions on setting values.
- Required flags
- --config_file (-c): Path to project directory
- --data_file (-f): Path to
.fix
file generated bydbdm INTEGRITY
.
- Optional flags
- --silent (-s): Silence standard output
- --integrity_cancel (-i): Cancel integrity check
- Example
dbdm FIX -c /path/to/database -f /path/to/########.###.fix
- This command will use
########.###.fix
to change values in records indatabase
.
Other Information
Accessory scripts
BioMetaBD/AccessoryScripts/edit_fix_file.py
- Allows users to make multiple changes to a
.fix
file with a single command. - This script will look for issues that match all user-passed filtering criteria and will replace the default fix value
with the user-passed value. Users can reference other fields in ISSUE (see the header in a generated
.fix
file). - Issues are written as:
ISSUE DATA_TYPE: ID[: LOCATION] ISSUE_TYPE FIX_TYPE[ FIX_DATA] ---
- Required flags
- --match (-m): Comma-separated list of strings that must be found in issue to make edit
- --replace (-r): Comma-separated list of strings with FIX_TYPE and value to use
- --fix_file (-f):
.fix
file output bydbdm INTEGRITY
- Example
edit_fix_file.py -m RECORD,BAD_TYPE,table_name -r SET,fasta -f /path/to/########.###.fix
- This command will replace the default fix value for issues with descriptions matching
RECORD
,BAD_TYPE
, andtable_name
. This fix value will be replaced withSET fasta
.
- Another (more complex) example
edit_fix_file.py -m RECORD,BAD_LOCATION,genomic -r "FILE,/path/to/<ID>.fna" -f /path/to/########.###.fix
- This command will replace the file locations for records with issues in the genomic table. It will replace
the default fix value with a file location whose value is created using the ID field of that particular issue.
Notice that the argument passed with
-r
is surrounded by double-quotes.
- Allows users to make multiple changes to a
Contact
This project was designed and written by Christopher Neely. Email me anytime at christopher.neely1200@gmail.com
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file BioMetaDB-0.1.3.0.tar.gz
.
File metadata
- Download URL: BioMetaDB-0.1.3.0.tar.gz
- Upload date:
- Size: 733.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9edd19585106d4a31ee04fad3b6576a3f4557c3e1b9efd1fe5d8c1136c486271 |
|
MD5 | 307122677d807b41379c30594452d7ce |
|
BLAKE2b-256 | dd9bfbd18a41fa7f4c6f3af9437dca5f380c9ec80809d88a6d7eac09e076f354 |