Package for parsing a tab-delimited data file and loading the lines into SQLite3 database
Project description
tsv2sqlite
Package for parsing a tab-delimited data file and loading the lines into SQLite3 database
Installation
pip install tsv2sqlite
tsv2sqlite
The software will automatically assign datatype TEXT for all columns. The target SQLite3 database file name can be specified on invocation via --database_file. The default database file name will be the basename of the input file with the filename extension replaced with .sqlite.
The target table name can be specified on invocation via --table_name. The default table name will be the basename of the input file with the filename extension removed.
The configuration file can be used to specify the datatype for each of the columns. The configuration file can be used to specify which columns should be excluded.
Configure column datatypes
table_schema:
sample:
CHROM: str
POS: int
ID: str
REF: str
ALT: str
QUAL: int
FILTER: str
INFO: str
Configure column exclusion list
ignore_columns:
- FILTER
- INFO
add-header-row
The software will write a copy of the input tab-delimited file that does not have a header row. The inserted header row will have column names: - col_1 - col_2 - col_3 - etc.
Edit configuration file
Edit your configuration file e.g.: tsv2sqlite/conf/config.yaml.
No column mapping mode
Run tsv2sqlite with --no_column_mapping option.
tsv2sqlite --infile ~/projects/tsv2sqlite/sample.tsv --config_file ~/projects/tsv2sqlite/tsv2sqlite/conf/config.yaml --no_column_mapping
--outdir was not specified and therefore was set to '/tmp/tsv2sqlite/2023-12-22-102420'
Created output directory '/tmp/tsv2sqlite/2023-12-22-102420'
--logfile was not specified and therefore was set to '/tmp/tsv2sqlite/2023-12-22-102420/main.log'
--database_file was not specified and therefore was set to '/home/sundaram/projects/tsv2sqlite/sample.tsv.sqlite3'
The log file is '/tmp/tsv2sqlite/2023-12-22-102420/main.log'
Execution of '/home/sundaram/projects/tsv2sqlite/venv/lib/python3.10/site-packages/tsv2sqlite/main.py' completed
Observe the SQLite3 database file created.
(venv) ➜ tsv2sqlite git:(main) ✗ ls -ltr sample.tsv.sqlite3
-rw-r--r-- 1 sundaram sundaram 12288 Dec 22 10:24 sample.tsv.sqlite3
Connect to the SQLite3 database.
(venv) ➜ tsv2sqlite git:(main) ✗ sqlite3 sample.tsv.sqlite3
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite>
Observe the two tables created.
sqlite> .schema
CREATE TABLE provenance (
bytesize INTEGER NOT NULL,
abspath TEXT NOT NULL,
md5checksum TEXT NOT NULL,
date_created TEXT NOT NULL
);
CREATE TABLE sample (CHROM TEXT,
POS TEXT,
ID TEXT,
REF TEXT,
ALT TEXT,
QUAL TEXT,
line_number INTEGER NOT NULL);
sqlite>
Execute queries against both tables.
sqlite> select * from provenance;
776|/home/sundaram/projects/tsv2sqlite/sample.tsv|786b82b2414d3acf7af34c068e358759|2023-11-12 21:31:17.307717
sqlite> select * from sample;
1|12345|rs567|A|G|50|3
2|56789|rs890|T|C|44|4
3|98765|rs123|G|T|60|5
1|34567|rs456|C|A|55|6
sqlite>
Column mapping mode
Column mapping was implemented to address a particular use case and may be deprecated in the near future.
To use the exported script for parsing a tab-delimited file to be loaded into a SQLite3 database instance:
tsv2sqlite --infile sample.tsv
--config_file was not specified and therefore was set to '/home/sundaram/projects/tsv2sqlite/venv/lib/python3.10/site-packages/tsv2sqlite/conf/config.yaml'
--outdir was not specified and therefore was set to '/tmp/tsv2sqlite/2023-11-13-053537'
Created output directory '/tmp/tsv2sqlite/2023-11-13-053537'
--logfile was not specified and therefore was set to '/tmp/tsv2sqlite/2023-11-13-053537/main.log'
--database_file was not specified and therefore was set to 'sample.tsv.sqlite3'
The log file is '/tmp/tsv2sqlite/2023-11-13-053537/main.log'
Execution of '/home/sundaram/projects/tsv2sqlite/venv/lib/python3.10/site-packages/tsv2sqlite/main.py' completed
Execute SQL queries
Connect to the SQLite3 database:
sqlite3 sample.tsv.sqlite3
Inspect the file metadata:
sqlite> .schema provenance
CREATE TABLE provenance (
bytesize INTEGER NOT NULL,
abspath TEXT NOT NULL,
md5checksum TEXT NOT NULL,
date_created TEXT NOT NULL
);
sqlite> select * from provenance;
776|/home/sundaram/projects/tsv2sqlite/sample.tsv|786b82b2414d3acf7af34c068e358759|2023-11-12 21:31:17.307717
Inspect the column mappings:
sqlite> .schema columnmaps
CREATE TABLE columnmaps (
column_num INTEGER NOT NULL PRIMARY KEY,
name TEXT NOT NULL,
norm_name TEXT NOT NULL,
UNIQUE (name, norm_name)
);
sqlite>
sqlite> select * from columnmaps;
0|#CHROM|_CHROM
1|POS|POS
2|ID|ID
3|REF|REF
4|ALT|ALT
5|QUAL|QUAL
sqlite>
Inspect the records table:
sqlite> .schema records
CREATE TABLE records (
line_num INTEGER NOT NULL,
column_num INTEGER NOT NULL,
value TEXT NOT NULL,
UNIQUE (line_num, column_num),
FOREIGN KEY (column_num) REFERENCES columnmaps (column_num)
);
sqlite> select * from records;
3|0|1
3|1|12345
3|2|rs567
3|3|A
3|4|G
3|5|50
4|0|2
4|1|56789
4|2|rs890
4|3|T
4|4|C
4|5|44
5|0|3
5|1|98765
5|2|rs123
5|3|G
5|4|T
5|5|60
6|0|1
6|1|34567
6|2|rs456
6|3|C
6|4|A
6|5|55
sqlite>
Get the ID values for each line:
sqlite> select r_id.value, r_id.line_num
from records r_id, columnmaps c_id
where c_id.name = "ID"
and c_id.column_num = r_id.column_num;
rs567|3
rs890|4
rs123|5
rs456|6
sqlite>
Get the ID, REF and line number:
sqlite> select r_id.value, r_ref.value, r_ref.line_num
from records r_id, records r_ref, columnmaps c_id, columnmaps c_ref
where c_ref.name = "REF"
and c_id.name = "ID"
and c_id.column_num = r_id.column_num
and c_ref.column_num = r_ref.column_num
and r_id.line_num = r_ref.line_num;
rs567|A|3
rs890|T|4
rs123|G|5
rs456|C|6
sqlite>
History
0.1.0 (2023-11-12)
First release on PyPI.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tsv2sqlite-0.5.1.tar.gz.
File metadata
- Download URL: tsv2sqlite-0.5.1.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b108c0fa41da05d753346466d62ebc51dbcbac2056b08e9f8dc9fa784fa8f26e
|
|
| MD5 |
e03fc1d27480a48122a1ae1b13372074
|
|
| BLAKE2b-256 |
245f39206a584ed984c12fc509cac980d0893d59af9d611c756f4f2f27450136
|
File details
Details for the file tsv2sqlite-0.5.1-py2.py3-none-any.whl.
File metadata
- Download URL: tsv2sqlite-0.5.1-py2.py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae059f300662edd7a940ea441f2f118a35952cebc1ef7e8dda298d067b804632
|
|
| MD5 |
1b3af20e60a3682551065e599879d34d
|
|
| BLAKE2b-256 |
36c722187199eda39a9472d98aef8a5436c05b21bd6d18e81a2bb8d6b14bd6ea
|