Skip to main content

Print an Elasticsearch inverted index as a CSV table or JSON object.

Project description

# inelastic
Print an Elasticsearch inverted index as a CSV table or JSON object.


`inelastic` builds an approximation of how an inverted index would look like for a particular index and document field, using the [Multi termvectors API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-termvectors.html) on all stored documents.

## Example

Having the following index:
```
PUT /tweets
{
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text"
}
}
}
}
}
```

with the following documents:
```
POST /tweets/_doc/_bulk
{ "index": { "_id": 1 }}
{ "content": "This is my first tweet." }
{ "index": { "_id": 2 }}
{ "content": "Most Elasticsearch examples use tweets." }
{ "index": { "_id": 3 }}
{ "content": "This is an example." }
{ "index": { "_id": 4 }}
{ "content": "Adding some more tweets." }
{ "index": { "_id": 5 }}
{ "content": "Adding more and more tweets." }
```

`inelastic` could be used as follows (combined with the `column` command):

```bash
$ ./inelastic.py -i tweets -f content | column -t -s ,
```

Which would output:
```
term freq doc_count d0 d1 d2
adding 2 2 4 5
an 1 1 3
and 1 1 5
elasticsearch 1 1 2
example 1 1 3
examples 1 1 2
first 1 1 1
is 2 2 1 3
more 3 2 4 5
most 1 1 2
my 1 1 1
some 1 1 4
this 2 2 1 3
tweet 1 1 1
tweets 3 3 2 4 5
use 1 1 2
```

The `freq` field specifies the total amount of times the term appears in all documents, and the `doc_count` field specifies how many documents contain the term at least once. The `d0`, `d1`... fields list the IDs for documents containing the term.

The chosen document field's type must be `text` or `keyword`.

## Installation
Having `git` and Python 3 installed, use:
```bash
$ git clone https://github.com/federicotdn/inelastic.git
$ cd inelastic
$ pip3 install -r requirements.txt # Optionally, do this inside a virtual environment
$ chmod +x inelastic.py
```

Inelastic is compatible with Elasticsearch version `6.0` and later.

## Usage
These are the arguments `inelastic` accepts:
- `-i` (`--index`): Index name (**required**).
- `-f` (`--field`): Document field name from which to generate inverted index (**required**).
- `-l` (`--id-field`): Document field to use as ID when printing results (*default: _id*).
- `-o` (`--output`): Output format, `json` or `csv` (*default: `csv`*).
- `-p` (`--port`): Elasticsearch host port (*default: 9200*).
- `-e` (`--host`): Elasticsearch host address (*default: localhost*).
- `-d` (`--doctype`): Document type (*default: _doc*).
- `-v` (`--verbose`): Print debug information (*default: false*).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inelastic-0.1.0.tar.gz (8.3 kB view details)

Uploaded Source

File details

Details for the file inelastic-0.1.0.tar.gz.

File metadata

  • Download URL: inelastic-0.1.0.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/20.7.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.5.2

File hashes

Hashes for inelastic-0.1.0.tar.gz
Algorithm Hash digest
SHA256 631fcf08851332b3d2a30656f5d1ca3078f8f8ae4c3a32678b4d6617eb238584
MD5 0dede9b98698358b46334269fc570408
BLAKE2b-256 a7801400838f372171fb546d8ae5dc9c621c4290a6f5a5e5537ae2c5f50f5ecd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page