Skip to main content

Print an Elasticsearch inverted index as a CSV table or JSON object.

Project description

# inelastic
[![Build Status](https://travis-ci.org/federicotdn/inelastic.svg)](https://travis-ci.org/federicotdn/inelastic)
[![Version](https://img.shields.io/pypi/v/inelastic.svg?style=flat)](https://pypi.python.org/pypi/inelastic)

Print an Elasticsearch inverted index as a CSV table or JSON object.

`inelastic` builds an approximation of how an inverted index would look like for a particular index and document field, using the [Multi termvectors API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-termvectors.html) on all stored documents.

## Installation
To install `inelastic`, run the following command:
```bash
$ pip3 install --upgrade inelastic
```

`inelastic` is compatible with Elasticsearch versions `6.0.0` and later.

## Example

Having the following index:
```
PUT /tweets
{
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text"
}
}
}
}
}
```

with the following documents:
```
POST /tweets/_doc/_bulk
{ "index": { "_id": 1 }}
{ "content": "This is my first tweet." }
{ "index": { "_id": 2 }}
{ "content": "Most Elasticsearch examples use tweets." }
{ "index": { "_id": 3 }}
{ "content": "This is an example." }
{ "index": { "_id": 4 }}
{ "content": "Adding some more tweets." }
{ "index": { "_id": 5 }}
{ "content": "Adding more and more tweets." }
```

`inelastic` could be used as follows (combined with the `column` command):

```bash
$ inelastic -i tweets -f content | column -t -s ,
```

Which would output:
```
term freq doc_count d0 d1 d2
adding 2 2 4 5
an 1 1 3
and 1 1 5
elasticsearch 1 1 2
example 1 1 3
examples 1 1 2
first 1 1 1
is 2 2 1 3
more 3 2 4 5
most 1 1 2
my 1 1 1
some 1 1 4
this 2 2 1 3
tweet 1 1 1
tweets 3 3 2 4 5
use 1 1 2
```

The `freq` field specifies the total amount of times the term appears in all documents, and the `doc_count` field specifies how many documents contain the term at least once. The `d0`, `d1`... fields list the IDs for documents containing the term.

The chosen document field's type must be `text` or `keyword`.

## Usage
These are the arguments `inelastic` accepts:
- `-i` (`--index`): Index name (**required**).
- `-f` (`--field`): Document field name from which to generate inverted index (**required**).
- `-l` (`--id-field`): Document field to use as ID when printing results (*default: _id*).
- `-o` (`--output`): Output format, `json` or `csv` (*default: `csv`*).
- `-p` (`--port`): Elasticsearch host port (*default: 9200*).
- `-e` (`--host`): Elasticsearch host address (*default: localhost*).
- `-d` (`--doctype`): Document type (*default: _doc*).
- `-v` (`--verbose`): Print debug information (*default: false*).

## Scripting
The `inelastic` module exposes the `InvertedIndex` class, which can be used in custom Python scripts:
```python
from inelastic import InvertedIndex
from elasticsearch import Elasticsearch

es = Elasticsearch()
ii = InvertedIndex(search_size=250, scroll_time='10s')

n_docs, errors = ii.read_index(es, 'tweets', 'content')

print('# docs: {}, # errors: {}'.format(n_docs, errors))

for entry in ii.to_list():
print(entry)
```

When run, the previous script will output:
```
# docs: 5, # errors: 0
('adding', <IndexEntry IDs: ['4', '5']>)
('an', <IndexEntry IDs: ['3']>)
('and', <IndexEntry IDs: ['5']>)
('elasticsearch', <IndexEntry IDs: ['2']>)
('example', <IndexEntry IDs: ['3']>)
('examples', <IndexEntry IDs: ['2']>)
('first', <IndexEntry IDs: ['1']>)
('is', <IndexEntry IDs: ['1', '3']>)
('more', <IndexEntry IDs: ['4', '5']>)
('most', <IndexEntry IDs: ['2']>)
('my', <IndexEntry IDs: ['1']>)
('some', <IndexEntry IDs: ['4']>)
('this', <IndexEntry IDs: ['1', '3']>)
('tweet', <IndexEntry IDs: ['1']>)
('tweets', <IndexEntry IDs: ['2', '4', '5']>)
('use', <IndexEntry IDs: ['2']>)
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inelastic-0.1.1.tar.gz (8.8 kB view details)

Uploaded Source

File details

Details for the file inelastic-0.1.1.tar.gz.

File metadata

  • Download URL: inelastic-0.1.1.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/20.7.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.5.2

File hashes

Hashes for inelastic-0.1.1.tar.gz
Algorithm Hash digest
SHA256 818f30c34c63e893d2a859af253523792b56669664be815c4a1436075ceb0cdd
MD5 099b36841a0fb90b72520ac78af949f0
BLAKE2b-256 06c48f8c2b7255bc071e0e6b8bae0cd2d5938fd609b00223988bda64cbc3d38b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page