Skip to main content

Indexer for GZIP specially built for DLIO Profiler.

Project description

DISCLAIMER

This repo is a fork of the original repo located at https://github.com/mattgodbolt/zindex. We modify this repo for using it cohesively with DLIO Profiler https://github.com/hariharan-devarajan/dlio-profiler.

zindex creates and queries an index on a compressed, line-based text file in a time- and space-efficient way.

The itch I had

I have many multigigabyte text gzipped log files and I'd like to be able to find data in them by an index. There's a key on each line that a simple regex can pull out. However, to find a particular record requires zgrep, which takes ages as it has to seek through gigabytes of previous data to get to each record.

Enter zindex which builds an index and also stores decompression checkpoints along the way which allows lightning fast random access. Pulling out single lines by either line number of by an index entry is then almost instant, even for huge files. The indices themselves are small too, typically ~10% of the compressed file size for a simple unique numeric index.

Creating an index

zindex needs to be told what part of each line constitutes the index. This can be done by a regular expression, by field, or by piping each line through an external program.

By default zindex creates an index of file.gz.zindex when asked to index file.gz.

Example: create an index on lines matching a numeric regular expression. The capture group indicates the part that's to be indexed, and the options show each line has a unique, numeric index.

$ zindex file.gz --regex 'id:([0-9]+)' --numeric --unique

Example: create an index on the second field of a CSV file:

$ zindex file.gz --delimiter , --field 2

Example: create an index on a JSON field orderId.id in any of the items in the document root's actions array (requires jq). The jq query creates an array of all the orderId.ids, then joins them with a space to ensure each individual line piped to jq creates a single line of output, with multiple matches separated by spaces (which is the default separator).

$ zindex file.gz --pipe "jq --raw-output --unbuffered '[.actions[].orderId.id] | join(\" \")'"

Multiple indices, and configuration of the index creation by JSON configuration file are supported, see below.

Querying the index

The zq program is used to query an index. It's given the name of the compressed file and a list of queries. For example:

$ zq file.gz 1023 4443 554

It's also possible to output by line number, so to print lines 1 and 1000 from a file:

$ zq file.gz --line 1 1000

Building from source

zindex uses CMake for its basic building (though has a bootstrapping Makefile), and requires a C++11 compatible compiler (GCC 4.8 or above and clang 3.4 and above). It also requires zlib. With the relevant compiler available, building ought to be as simple as:

$ git clone https://github.com/mattgodbolt/zindex.git
$ cd zindex
$ make

Binaries are left in build/Release.

Additionally a static binary can be built if you're happy to dip your toe into CMake:

$ cd path/to/build/directory
$ cmake path/to/zindex/checkout/dir -DStatic:BOOL=On -DCMAKE_BUILD_TYPE=Release
$ make

Multiple indices

To support more than one index, or for easier configuration than all the command-line flags that might be needed, there is a JSON configuration format. Pass the --config <yourconfigfile>.json option and put something like this in the configuration file:

{ 
    "indexes": [
        {
            "type": "field",
            "delimiter": "\t",
            "fieldNum": 1
        },
        {
            "name": "secondary",
            "type": "field",
            "delimiter": "\t",
            "fieldNum": 2
        }
    ]
}

This creates two indices, one on the first field and one on the second field, as delimited by tabs. One can then specify which index to query with the -i <index> option of zq.

Issues and feature requests

See the issue tracker for TODOs and known bugs. Please raise bugs there, and feel free to submit suggestions there also.

Feel free to contact me if you prefer email over bug trackers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zindex_py-0.0.4.tar.gz (1.6 MB view details)

Uploaded Source

Built Distributions

zindex_py-0.0.4-cp310-cp310-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.34+ x86-64

zindex_py-0.0.4-cp39-cp39-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.34+ x86-64

zindex_py-0.0.4-cp38-cp38-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.34+ x86-64

zindex_py-0.0.4-cp37-cp37m-manylinux_2_34_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.34+ x86-64

File details

Details for the file zindex_py-0.0.4.tar.gz.

File metadata

  • Download URL: zindex_py-0.0.4.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for zindex_py-0.0.4.tar.gz
Algorithm Hash digest
SHA256 22739b8c59fedf24198003051a17279e74c2cbd7500c356fa12dc272547c1652
MD5 79158e9b9000bdeb11bc66639543f70b
BLAKE2b-256 e06f3a6230d6e600bf749a6a529d4ab6f40d7f439958962bf358b4c26e4ce58a

See more details on using hashes here.

File details

Details for the file zindex_py-0.0.4-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for zindex_py-0.0.4-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 ca755f55746f5a8b15d0473aa638e76ad1302f18a0eb9947b47f1ce11975d723
MD5 ec3ab91636f36808422367d7090c5fff
BLAKE2b-256 2ee8a61d03995b3ec496f27c38147e9efd2019cfd987c527bc3398772c82e467

See more details on using hashes here.

File details

Details for the file zindex_py-0.0.4-cp39-cp39-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for zindex_py-0.0.4-cp39-cp39-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 22cd3fb1090251a1e828d3c6304b4417d649c113389e6192b4cb4775884e7980
MD5 a1d2ab54cba1c4fedd46ae0e4e75a785
BLAKE2b-256 f643c96df3ab0de79743bc2f5448fff2f5f144a5fab6584441aa277335c719c7

See more details on using hashes here.

File details

Details for the file zindex_py-0.0.4-cp38-cp38-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for zindex_py-0.0.4-cp38-cp38-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 1d2fc93f1ca4fdff0333edcd61cf4310f96fc6921671f4ff43d7f9e6e2c7cbbf
MD5 8cd57d155fbb0f57c2d30a4ec4637688
BLAKE2b-256 9b62a6ed4c8bf5f1723d0b897784fa8cf3eebd75ff8ba387e11ad8189601aa1f

See more details on using hashes here.

File details

Details for the file zindex_py-0.0.4-cp37-cp37m-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for zindex_py-0.0.4-cp37-cp37m-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 4b92d66c54d1c3710ebc47c61cbf860872690335f929a0304d53126698cdbdd1
MD5 e06aa8cc00507d10292b4c3b85cc6710
BLAKE2b-256 a65bad0d5ef21f2c811cb7de1be67443070b31e5dbccbd91a5b2741d426c9317

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page