Skip to main content

A Pandastic Elasticsearch client for data analyzing.

Project description

## Pandasticsearch = Elasticsearch + Pandas DataFrame

Pandasticsearch is a lightweight Elasticsearch client for data-analysis purpose. It interprets query results into
[Pandas](http://pandas.pydata.org) DataFrame objects for data analysis. This can be used to gain direct insight
from Elasticsearch's analysis result, e.g. multi-level nested aggregation. Elasticsearch is skilled
in real-time indexing, search and data-analysis. The results returned by Elasticsearch Rest API still
require processing before data scientists can conduct an analysis on.

To install:

```
pip3 install pandasticsearch
```

## Connect to ES

### High Level API

A `DataFrame` object accesses Elasticsearch with high level API, like [elasticsearch-dsl-py](https://github.com/elastic/elasticsearch-dsl-py).


It is type-safe, easy-to-use and Pandas-flavored.

```python
# create a DataFrame object
>>> from pandasticsearch import DataFrame
>>> df = DataFrame.from_es('http://localhost:9200', index='people')
>>> df.columns
['name', 'age', 'gender']
>>> df.printSchema()
company
|-- employee
|-- name: {'index': 'not_analyzed', 'type': 'string'}
|-- age: {'type': 'integer'}
|-- gender: {'index': 'not_analyzed', 'type': 'string'}

# filter
>>> df.filter(df['age'] < 25).collect()
[Row(age=12,gender='female',name='Alice'), Row(age=11,gender='male',name='Bob'), Row(age=13,gender='male',name='Leo')]

# projection
>>> df.filter(df['age'] < 25).select('name', 'age').collect()
[Row(age=12,name='Alice'), Row(age=11,name='Bob'), Row(age=13,name='Leo')]

# print the rows into console
>>> df.filter(df['age'] < 25).select('name').show(3)
+------+
| name |
+------+
| Alice|
| Bob |
| Leo |
+------+

# aggregation
>>> from pandasticsearch import Avg
>>> df[df['gender'] == 'male'].agg(Avg('age')).collect()
[Row(avg(age)=12)]

# convert to Pandas object for subsequent analysis
>>> df[df['gender'] == 'male'].agg(Avg('age')).to_pandas()
avg(age)
0 12
```


### RestClient

A `RestClient` talks to default Elasticsearch Rest API:

```python
>>> from pandasticsearch import RestClient, Select
>>> client = RestClient('http://localhost:9200', 'recruit/resume/_search')
>>> result_dict = client.post("query":{"match_all":{}}})
>>> Select.from_dict(result_dict)
Select: 3 rows
```

It can also talk to [Elasticsearch-SQL](https://github.com/NLPchina/elasticsearch-sql):

```python
>>> client = RestClient('http://localhost:9200', '_sql')
>>> result_dict = client.post(params={'sql': 'select * from table_name limit 3'})
>>> Select.from_dict(result_dict)
Select: 3 rows
```

### Use with Another Python Client

Pandasticsearch can also be used with another full featured Python client:

* [elasticsearch-py](https://github.com/elastic/elasticsearch-py) (Official)
* [pyelasticsearch](https://github.com/pyelasticsearch/pyelasticsearch)
* [pyes](https://github.com/aparo/pyes)

```python
>>> from elasticsearch import Elasticsearch, Select
>>> es = Elasticsearch('http://localhost:9200')
>>> result_dict = es.search(index="recruit", body={"query": {"match_all": {}}})
>>> Select.from_dict(result_dict)
Select: 10 rows
```


## Related Articles

* [Spark and Elasticsearch for real-time data analysis](https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-35-Leau.pdf)


## LICENSE

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandasticsearch-0.0.12.tar.gz (241.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page