Skip to main content

Package for parsing a tab-delimited data file and loading the lines into SQLite3 database

Project description

tsv2sqlite

Package for parsing a tab-delimited data file and loading the lines into SQLite3 database

Installation

pip install tsv2sqlite

tsv2sqlite

The software will automatically assign datatype TEXT for all columns. The target SQLite3 database file name can be specified on invocation via --database_file. The default database file name will be the basename of the input file with the filename extension replaced with .sqlite.

The target table name can be specified on invocation via --table_name. The default table name will be the basename of the input file with the filename extension removed.

The configuration file can be used to specify the datatype for each of the columns. The configuration file can be used to specify which columns should be excluded.

Configure column datatypes

table_schema:
  sample:
    CHROM: str
    POS: int
    ID: str
    REF: str
    ALT: str
    QUAL: int
    FILTER: str
    INFO: str

Configure column exclusion list

ignore_columns:
  - FILTER
  - INFO

add-header-row

The software will write a copy of the input tab-delimited file that does not have a header row. The inserted header row will have column names: - col_1 - col_2 - col_3 - etc.

Edit configuration file

Edit your configuration file e.g.: tsv2sqlite/conf/config.yaml.

No column mapping mode

Run tsv2sqlite with --no_column_mapping option.

tsv2sqlite --infile ~/projects/tsv2sqlite/sample.tsv --config_file ~/projects/tsv2sqlite/tsv2sqlite/conf/config.yaml --no_column_mapping
--outdir was not specified and therefore was set to '/tmp/tsv2sqlite/2023-12-22-102420'
Created output directory '/tmp/tsv2sqlite/2023-12-22-102420'
--logfile was not specified and therefore was set to '/tmp/tsv2sqlite/2023-12-22-102420/main.log'
--database_file was not specified and therefore was set to '/home/sundaram/projects/tsv2sqlite/sample.tsv.sqlite3'
The log file is '/tmp/tsv2sqlite/2023-12-22-102420/main.log'
Execution of '/home/sundaram/projects/tsv2sqlite/venv/lib/python3.10/site-packages/tsv2sqlite/main.py' completed

Observe the SQLite3 database file created.

(venv)   tsv2sqlite git:(main)  ls -ltr sample.tsv.sqlite3
-rw-r--r-- 1 sundaram sundaram 12288 Dec 22 10:24 sample.tsv.sqlite3

Connect to the SQLite3 database.

(venv)   tsv2sqlite git:(main)  sqlite3 sample.tsv.sqlite3
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite>

Observe the two tables created.

sqlite> .schema
CREATE TABLE provenance (
            bytesize INTEGER NOT NULL,
            abspath TEXT NOT NULL,
            md5checksum TEXT NOT NULL,
            date_created TEXT NOT NULL
        );
CREATE TABLE sample (CHROM TEXT,
POS TEXT,
ID TEXT,
REF TEXT,
ALT TEXT,
QUAL TEXT,
line_number INTEGER NOT NULL);
sqlite>

Execute queries against both tables.

sqlite> select * from provenance;
776|/home/sundaram/projects/tsv2sqlite/sample.tsv|786b82b2414d3acf7af34c068e358759|2023-11-12 21:31:17.307717
sqlite> select * from sample;
1|12345|rs567|A|G|50|3
2|56789|rs890|T|C|44|4
3|98765|rs123|G|T|60|5
1|34567|rs456|C|A|55|6
sqlite>

Column mapping mode

Column mapping was implemented to address a particular use case and may be deprecated in the near future.

To use the exported script for parsing a tab-delimited file to be loaded into a SQLite3 database instance:

 tsv2sqlite --infile sample.tsv
--config_file was not specified and therefore was set to '/home/sundaram/projects/tsv2sqlite/venv/lib/python3.10/site-packages/tsv2sqlite/conf/config.yaml'
--outdir was not specified and therefore was set to '/tmp/tsv2sqlite/2023-11-13-053537'
Created output directory '/tmp/tsv2sqlite/2023-11-13-053537'
--logfile was not specified and therefore was set to '/tmp/tsv2sqlite/2023-11-13-053537/main.log'
--database_file was not specified and therefore was set to 'sample.tsv.sqlite3'
The log file is '/tmp/tsv2sqlite/2023-11-13-053537/main.log'
Execution of '/home/sundaram/projects/tsv2sqlite/venv/lib/python3.10/site-packages/tsv2sqlite/main.py' completed

Execute SQL queries

Connect to the SQLite3 database:

sqlite3 sample.tsv.sqlite3

Inspect the file metadata:

sqlite> .schema provenance
CREATE TABLE provenance (
            bytesize INTEGER NOT NULL,
            abspath TEXT NOT NULL,
            md5checksum TEXT NOT NULL,
            date_created TEXT NOT NULL
        );
sqlite> select * from provenance;
776|/home/sundaram/projects/tsv2sqlite/sample.tsv|786b82b2414d3acf7af34c068e358759|2023-11-12 21:31:17.307717

Inspect the column mappings:

sqlite> .schema columnmaps
CREATE TABLE columnmaps (
            column_num INTEGER NOT NULL PRIMARY KEY,
            name TEXT NOT NULL,
            norm_name TEXT NOT NULL,
            UNIQUE (name, norm_name)
        );
sqlite>

sqlite> select * from columnmaps;
0|#CHROM|_CHROM
1|POS|POS
2|ID|ID
3|REF|REF
4|ALT|ALT
5|QUAL|QUAL
sqlite>

Inspect the records table:

sqlite> .schema records
CREATE TABLE records (
            line_num INTEGER NOT NULL,
            column_num INTEGER NOT NULL,
            value TEXT NOT NULL,
            UNIQUE (line_num, column_num),
            FOREIGN KEY (column_num) REFERENCES columnmaps (column_num)
        );
sqlite> select * from records;
3|0|1
3|1|12345
3|2|rs567
3|3|A
3|4|G
3|5|50
4|0|2
4|1|56789
4|2|rs890
4|3|T
4|4|C
4|5|44
5|0|3
5|1|98765
5|2|rs123
5|3|G
5|4|T
5|5|60
6|0|1
6|1|34567
6|2|rs456
6|3|C
6|4|A
6|5|55
sqlite>

Get the ID values for each line:

sqlite> select r_id.value, r_id.line_num
from records r_id, columnmaps c_id
where c_id.name = "ID"
and c_id.column_num = r_id.column_num;
rs567|3
rs890|4
rs123|5
rs456|6
sqlite>

Get the ID, REF and line number:

sqlite> select r_id.value, r_ref.value, r_ref.line_num
from records r_id, records r_ref, columnmaps c_id, columnmaps c_ref
where c_ref.name = "REF"
and c_id.name = "ID"
and c_id.column_num = r_id.column_num
and c_ref.column_num = r_ref.column_num
and r_id.line_num = r_ref.line_num;
rs567|A|3
rs890|T|4
rs123|G|5
rs456|C|6
sqlite>

History

0.1.0 (2023-11-12)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tsv2sqlite-0.5.1.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tsv2sqlite-0.5.1-py2.py3-none-any.whl (16.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file tsv2sqlite-0.5.1.tar.gz.

File metadata

  • Download URL: tsv2sqlite-0.5.1.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for tsv2sqlite-0.5.1.tar.gz
Algorithm Hash digest
SHA256 b108c0fa41da05d753346466d62ebc51dbcbac2056b08e9f8dc9fa784fa8f26e
MD5 e03fc1d27480a48122a1ae1b13372074
BLAKE2b-256 245f39206a584ed984c12fc509cac980d0893d59af9d611c756f4f2f27450136

See more details on using hashes here.

File details

Details for the file tsv2sqlite-0.5.1-py2.py3-none-any.whl.

File metadata

  • Download URL: tsv2sqlite-0.5.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for tsv2sqlite-0.5.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ae059f300662edd7a940ea441f2f118a35952cebc1ef7e8dda298d067b804632
MD5 1b3af20e60a3682551065e599879d34d
BLAKE2b-256 36c722187199eda39a9472d98aef8a5436c05b21bd6d18e81a2bb8d6b14bd6ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page