Skip to main content

Utility scripts to work with ffindex databases files.

Project description

ffdb

Scripts for manipulating ffindex databases.

FFINDEX is a really neat way to work with many (millions) of homogeneous files that avoids filesystem penalties and lets you run tasks on these files in parallel. From my perspective, it lacks some utilities that would make it great for running pipelines. Especially checkpointing (e.g. for long running tasks) and fold/reduce-like tasks (e.g. collecting many csv files into one final csv file).

These scripts are really just to help make up for those shortfalls. They aren't necessarily performant, we don't do any fancy memory mapping etc, so if you can find another tool to do it let me know!

Usage

ffdb is implemented as a single executable ffdb with multiple subcommands.

ffdb split

Splits an existing ffindex database into a number of partitions.

ffdb split \
  --size 10000 \
  --basename "subdb_{index}.{ext}" \
  my.ffdata \
  my.ffindex

Would create files subdb_0.ffdata subdb_0.ffindex subdb_1.ffdata ... with each subdb containing 10000 files from each.

ffdb combine

Collect multiple ffindex databases into a single one.

ffdb combine \
  -d out.ffdata \
  -i out.ffindex \
  subdb_*.{ffdata,ffindex}

This will combine all subdbs matching the shell expansion into a single database. The order of .ffdata and .ffindex files to be combined is important. There should be provided a (space separated) list of .ffdata files and then a list of .ffindex files. The ffdata/ffindex lists should be in the same order.

It is designed so that the combination glob/brace expansion pattern used in the example will work correctly.

Otherwise you could write them out explicitly...

ffdb combine \
  -d out.ffdata \
  -i out.ffindex \
  subdb_0.ffdata subdb_1.ffdata subdb_2.ffdata \
  subdb_0.ffindex subdb_1.ffindex subdb_2.ffindex

ffdb fasta

Creates an ffindex database from a fasta file with each document having a specified number of fasta records in it. Note that ffindex does have a tool to read a fasta in with one sequence per database document.

ffdb fasta \
  -d out.ffdata \
  -i out.ffindex \
  --size 10000 \
  my.fasta

Would create a new database where each "file" within the database has 10000 sequences (except the last which will have the remainder).

ffdb collect

Collects multiple documents in an ffindex database into a single file. Essentially it just filters out null-bytes and makes sure there's a newline between documents. It can also optionally skip the first few lines of each document, e.g. for collecting csv files.

ffdb collect \
  --trim 1 \
  csv.ffdata \
  csv.ffindex \
> out.csv

This essentially just cats all files together excluding the first line from each file. Collect can also take multiple databases using the same glob pattern as used in ffdb combine.

ffdb collect many_gffs_*.{ffdata,ffindex} > out.gff3

ffdb join_concat

Joins multiple ffindex databases, concatenating documents based on shared names in the index. It is a full outer join, meaning that documents in 1 but not 2 are included and those in 2 but not 1 are included. Can join multiple files at once. Documents are concatenated with a single newline separator.

ffdb join_concat \
  -d joined.ffdata \
  -i joined.ffindex \
  msa.ffdata enriched_msa.ffdata \
  msa.ffindex enriched_msa.ffindex

My primary use-case for this was for working with MMSeqs results, where I wanted to enrich a profile and get the full multiple sequence alignment (including the profile MSA). mmseqs result2msa only prints the consensus of the profile.

Ordering for multiple databases is the same as for ffdb combine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ffindexdb-0.0.2.tar.gz (16.3 kB view details)

Uploaded Source

File details

Details for the file ffindexdb-0.0.2.tar.gz.

File metadata

  • Download URL: ffindexdb-0.0.2.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.1

File hashes

Hashes for ffindexdb-0.0.2.tar.gz
Algorithm Hash digest
SHA256 c031613861700bf6f42dfc51db75f207ef7afd2bca39d182e200499fd4a07a70
MD5 30353f9a7bbfe807a770a81e12cae902
BLAKE2b-256 a062fbaa27db42ba112d9b3556850b8846d7e603d1148cf7fe7c572647ba3232

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page