Skip to main content

Indexer for GZIP specially built for DLIO Profiler.

Project description

DISCLAIMER

This repo is a fork of the original repo located at https://github.com/mattgodbolt/zindex. We modify this repo for using it cohesively with DFTracer https://github.com/LLNL/dftracer.git.

zindex creates and queries an index on a compressed, line-based text file in a time- and space-efficient way.

The itch I had

I have many multigigabyte text gzipped log files and I'd like to be able to find data in them by an index. There's a key on each line that a simple regex can pull out. However, to find a particular record requires zgrep, which takes ages as it has to seek through gigabytes of previous data to get to each record.

Enter zindex which builds an index and also stores decompression checkpoints along the way which allows lightning fast random access. Pulling out single lines by either line number of by an index entry is then almost instant, even for huge files. The indices themselves are small too, typically ~10% of the compressed file size for a simple unique numeric index.

Creating an index

zindex needs to be told what part of each line constitutes the index. This can be done by a regular expression, by field, or by piping each line through an external program.

By default zindex creates an index of file.gz.zindex when asked to index file.gz.

Example: create an index on lines matching a numeric regular expression. The capture group indicates the part that's to be indexed, and the options show each line has a unique, numeric index.

$ zindex file.gz --regex 'id:([0-9]+)' --numeric --unique

Example: create an index on the second field of a CSV file:

$ zindex file.gz --delimiter , --field 2

Example: create an index on a JSON field orderId.id in any of the items in the document root's actions array (requires jq). The jq query creates an array of all the orderId.ids, then joins them with a space to ensure each individual line piped to jq creates a single line of output, with multiple matches separated by spaces (which is the default separator).

$ zindex file.gz --pipe "jq --raw-output --unbuffered '[.actions[].orderId.id] | join(\" \")'"

Multiple indices, and configuration of the index creation by JSON configuration file are supported, see below.

Querying the index

The zq program is used to query an index. It's given the name of the compressed file and a list of queries. For example:

$ zq file.gz 1023 4443 554

It's also possible to output by line number, so to print lines 1 and 1000 from a file:

$ zq file.gz --line 1 1000

Building from source

zindex uses CMake for its basic building (though has a bootstrapping Makefile), and requires a C++11 compatible compiler (GCC 4.8 or above and clang 3.4 and above). It also requires zlib. With the relevant compiler available, building ought to be as simple as:

$ git clone https://github.com/mattgodbolt/zindex.git
$ cd zindex
$ make

Binaries are left in build/Release.

Additionally a static binary can be built if you're happy to dip your toe into CMake:

$ cd path/to/build/directory
$ cmake path/to/zindex/checkout/dir -DStatic:BOOL=On -DCMAKE_BUILD_TYPE=Release
$ make

Multiple indices

To support more than one index, or for easier configuration than all the command-line flags that might be needed, there is a JSON configuration format. Pass the --config <yourconfigfile>.json option and put something like this in the configuration file:

{ 
    "indexes": [
        {
            "type": "field",
            "delimiter": "\t",
            "fieldNum": 1
        },
        {
            "name": "secondary",
            "type": "field",
            "delimiter": "\t",
            "fieldNum": 2
        }
    ]
}

This creates two indices, one on the first field and one on the second field, as delimited by tabs. One can then specify which index to query with the -i <index> option of zq.

Issues and feature requests

See the issue tracker for TODOs and known bugs. Please raise bugs there, and feel free to submit suggestions there also.

Feel free to contact me if you prefer email over bug trackers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zindex_py-0.0.7.tar.gz (1.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

zindex_py-0.0.7-cp311-cp311-manylinux_2_39_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.39+ x86-64

zindex_py-0.0.7-cp311-cp311-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

zindex_py-0.0.7-cp310-cp310-manylinux_2_39_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.39+ x86-64

zindex_py-0.0.7-cp310-cp310-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

zindex_py-0.0.7-cp39-cp39-manylinux_2_28_x86_64.whl (2.1 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

File details

Details for the file zindex_py-0.0.7.tar.gz.

File metadata

  • Download URL: zindex_py-0.0.7.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.12

File hashes

Hashes for zindex_py-0.0.7.tar.gz
Algorithm Hash digest
SHA256 1b1ee67e0a5ea0f5e8dea169215d851b66de62c49be2e0fc45d9a2fbf5cf6538
MD5 880d0ce224deece49ce047ed7f609117
BLAKE2b-256 48eba022683f56f0cc8e62a2a517c65b5343cfd944df854389499024d8885d53

See more details on using hashes here.

File details

Details for the file zindex_py-0.0.7-cp311-cp311-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for zindex_py-0.0.7-cp311-cp311-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 64d4b86365228c69fecdc9d18b5d647ffd1fd5b57562deb08775681e26d8150b
MD5 3abe160719555a6bc15a63efc9ff4f3c
BLAKE2b-256 2a757373bb8f41a7a175d90b04927042f40ba2f54a9fb2035d9b3dd5c43751ce

See more details on using hashes here.

File details

Details for the file zindex_py-0.0.7-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for zindex_py-0.0.7-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 3dcb16186a6b9acb6d419a6fd81bad153902ae4d5c0b63fa469cc44cc0f818e9
MD5 5207b9d82bf94ccabd1fda8c8be6805c
BLAKE2b-256 ee357eea5f24a5d27ac2c86111a86f71d06a9b4ed43f002fd2a5355f2ac09670

See more details on using hashes here.

File details

Details for the file zindex_py-0.0.7-cp310-cp310-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for zindex_py-0.0.7-cp310-cp310-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 c25801c3827d1a2a9b97ec3a79e83a1654a388352764e0f412c955e3faa303ce
MD5 fc0c52d2f2ec7b01332ca1e726d926c1
BLAKE2b-256 8e7e3ee704a6a3fbacabe5b7df6f404717214b2594673c45d256747710808396

See more details on using hashes here.

File details

Details for the file zindex_py-0.0.7-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for zindex_py-0.0.7-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 2b866e41df29ca764c4953a93cc6b396c835ed2e7ff45954f9a538e580b23770
MD5 2235578f592325b3cf426efd214c7c99
BLAKE2b-256 49aea6f3aed7bade37f5330591f95f5fcfee743f0a75d2c67c1a18d95a3a36e2

See more details on using hashes here.

File details

Details for the file zindex_py-0.0.7-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for zindex_py-0.0.7-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e616f0e4fd5a1c62a3d67ef8ba9708c40691d763bed5a5b494f1fb334978c317
MD5 444e53115ec0ecc9b8168cfea819b61d
BLAKE2b-256 a746d8492c3c8a4f60a108e7b8e108d83d75ee0b2d2cff6b9da5bddd07421643

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page