inelastic

Print an Elasticsearch inverted index as a CSV table or JSON object.

These details have not been verified by PyPI

Project links

Project description

# inelastic
[![Build Status](https://travis-ci.org/federicotdn/inelastic.svg)](https://travis-ci.org/federicotdn/inelastic)
[![Version](https://img.shields.io/pypi/v/inelastic.svg?style=flat)](https://pypi.python.org/pypi/inelastic)

Print an Elasticsearch inverted index as a CSV table or JSON object.

`inelastic` builds an approximation of how an inverted index would look like for a particular index and document field, using the [Multi termvectors API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-termvectors.html) on all stored documents.

## Installation
To install `inelastic`, run the following command:
```bash
$ pip3 install --upgrade inelastic
```

`inelastic` is compatible with Elasticsearch versions `6.0.0` and later.

## Example

Having the following index:
```
PUT /tweets
{
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text"
}
}
}
}
}
```

with the following documents:
```
POST /tweets/_doc/_bulk
{ "index": { "_id": 1 }}
{ "content": "This is my first tweet." }
{ "index": { "_id": 2 }}
{ "content": "Most Elasticsearch examples use tweets." }
{ "index": { "_id": 3 }}
{ "content": "This is an example." }
{ "index": { "_id": 4 }}
{ "content": "Adding some more tweets." }
{ "index": { "_id": 5 }}
{ "content": "Adding more and more tweets." }
```

`inelastic` could be used as follows (combined with the `column` command):

```bash
$ inelastic -i tweets -f content | column -t -s ,
```

Which would output:
```
term freq doc_count d0 d1 d2
adding 2 2 4 5
an 1 1 3
and 1 1 5
elasticsearch 1 1 2
example 1 1 3
examples 1 1 2
first 1 1 1
is 2 2 1 3
more 3 2 4 5
most 1 1 2
my 1 1 1
some 1 1 4
this 2 2 1 3
tweet 1 1 1
tweets 3 3 2 4 5
use 1 1 2
```

The `freq` field specifies the total amount of times the term appears in all documents, and the `doc_count` field specifies how many documents contain the term at least once. The `d0`, `d1`... fields list the IDs for documents containing the term.

The chosen document field's type must be `text` or `keyword`.

## Usage
These are the arguments `inelastic` accepts:
- `-i` (`--index`): Index name (**required**).
- `-f` (`--field`): Document field name from which to generate inverted index (**required**).
- `-l` (`--id-field`): Document field to use as ID when printing results (*default: _id*).
- `-o` (`--output`): Output format, `json` or `csv` (*default: `csv`*).
- `-p` (`--port`): Elasticsearch host port (*default: 9200*).
- `-e` (`--host`): Elasticsearch host address (*default: localhost*).
- `-d` (`--doctype`): Document type (*default: _doc*).
- `-v` (`--verbose`): Print debug information (*default: false*).

## Scripting
The `inelastic` module exposes the `InvertedIndex` class, which can be used in custom Python scripts:
```python
from inelastic import InvertedIndex
from elasticsearch import Elasticsearch

es = Elasticsearch()
ii = InvertedIndex(search_size=250, scroll_time='10s')

n_docs, errors = ii.read_index(es, 'tweets', 'content')

print('# docs: {}, # errors: {}'.format(n_docs, errors))

for entry in ii.to_list():
print(entry)
```

When run, the previous script will output:
```
# docs: 5, # errors: 0
('adding', <IndexEntry IDs: ['4', '5']>)
('an', <IndexEntry IDs: ['3']>)
('and', <IndexEntry IDs: ['5']>)
('elasticsearch', <IndexEntry IDs: ['2']>)
('example', <IndexEntry IDs: ['3']>)
('examples', <IndexEntry IDs: ['2']>)
('first', <IndexEntry IDs: ['1']>)
('is', <IndexEntry IDs: ['1', '3']>)
('more', <IndexEntry IDs: ['4', '5']>)
('most', <IndexEntry IDs: ['2']>)
('my', <IndexEntry IDs: ['1']>)
('some', <IndexEntry IDs: ['4']>)
('this', <IndexEntry IDs: ['1', '3']>)
('tweet', <IndexEntry IDs: ['1']>)
('tweets', <IndexEntry IDs: ['2', '4', '5']>)
('use', <IndexEntry IDs: ['2']>)
```

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.4

Aug 8, 2019

0.2.3

Aug 8, 2019

0.2.2

Aug 4, 2019

0.2.1

Aug 4, 2019

0.2.0

Aug 4, 2019

0.1.3

Nov 6, 2018

0.1.2

Nov 5, 2018

This version

0.1.1

Sep 10, 2018

0.1.0

Sep 7, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inelastic-0.1.1.tar.gz (8.8 kB view details)

Uploaded Sep 10, 2018 Source

File details

Details for the file inelastic-0.1.1.tar.gz.

File metadata

Download URL: inelastic-0.1.1.tar.gz
Upload date: Sep 10, 2018
Size: 8.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/20.7.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/3.5.2

File hashes

Hashes for inelastic-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`818f30c34c63e893d2a859af253523792b56669664be815c4a1436075ceb0cdd`
MD5	`099b36841a0fb90b72520ac78af949f0`
BLAKE2b-256	`06c48f8c2b7255bc071e0e6b8bae0cd2d5938fd609b00223988bda64cbc3d38b`