Skip to main content
Donate to the Python Software Foundation or Purchase a PyCharm License to Benefit the PSF! Donate Now

A Pandastic Elasticsearch client for data analyzing.

Project description

## Pandasticsearch = Elasticsearch + Pandas DataFrame

Pandasticsearch is a lightweight Elasticsearch client for data-analysis purpose. It interprets query results into
[Pandas](http://pandas.pydata.org) DataFrame objects for data analysis. This can be used to gain direct insight
from Elasticsearch's analysis result, e.g. multi-level nested aggregation. Elasticsearch is skilled
in real-time indexing, search and data-analysis. The results returned by Elasticsearch Rest API still
require processing before data scientists can conduct an analysis on.

To install:

```
pip3 install pandasticsearch
```

## Connect to ES

### High Level API

A `DataFrame` object accesses Elasticsearch with high level API, like [elasticsearch-dsl-py](https://github.com/elastic/elasticsearch-dsl-py).


It is type-safe, easy-to-use and Pandas-flavored.

```python
# create a DataFrame object
>>> from pandasticsearch import DataFrame
>>> df = DataFrame.from_es('http://localhost:9200', index='people')
>>> df.columns
['name', 'age', 'gender']
>>> df.printSchema()
company
|-- employee
|-- name: {'index': 'not_analyzed', 'type': 'string'}
|-- age: {'type': 'integer'}
|-- gender: {'index': 'not_analyzed', 'type': 'string'}

# filter
>>> df.filter(df['age'] < 25).collect()
[Row(age=12,gender='female',name='Alice'), Row(age=11,gender='male',name='Bob'), Row(age=13,gender='male',name='Leo')]

# projection
>>> df.filter(df['age'] < 25).select('name', 'age').collect()
[Row(age=12,name='Alice'), Row(age=11,name='Bob'), Row(age=13,name='Leo')]

# print the rows into console
>>> df.filter(df['age'] < 25).select('name').show(3)
+------+
| name |
+------+
| Alice|
| Bob |
| Leo |
+------+

# aggregation
>>> from pandasticsearch import Avg
>>> df[df['gender'] == 'male'].agg(Avg('age')).collect()
[Row(avg(age)=12)]

# convert to Pandas object for subsequent analysis
>>> df[df['gender'] == 'male'].agg(Avg('age')).to_pandas()
avg(age)
0 12
```


### RestClient

A `RestClient` talks to default Elasticsearch Rest API:

```python
>>> from pandasticsearch import RestClient, Select
>>> client = RestClient('http://localhost:9200', 'recruit/resume/_search')
>>> result_dict = client.post("query":{"match_all":{}}})
>>> Select.from_dict(result_dict)
Select: 3 rows
```

It can also talk to [Elasticsearch-SQL](https://github.com/NLPchina/elasticsearch-sql):

```python
>>> client = RestClient('http://localhost:9200', '_sql')
>>> result_dict = client.post(params={'sql': 'select * from table_name limit 3'})
>>> Select.from_dict(result_dict)
Select: 3 rows
```

### Use with Another Python Client

Pandasticsearch can also be used with another full featured Python client:

* [elasticsearch-py](https://github.com/elastic/elasticsearch-py) (Official)
* [pyelasticsearch](https://github.com/pyelasticsearch/pyelasticsearch)
* [pyes](https://github.com/aparo/pyes)

```python
>>> from elasticsearch import Elasticsearch, Select
>>> es = Elasticsearch('http://localhost:9200')
>>> result_dict = es.search(index="recruit", body={"query": {"match_all": {}}})
>>> Select.from_dict(result_dict)
Select: 10 rows
```


## Related Articles

* [Spark and Elasticsearch for real-time data analysis](https://spark-summit.org/2015-east/wp-content/uploads/2015/03/SSE15-35-Leau.pdf)


## LICENSE

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
pandasticsearch-0.0.12.tar.gz (241.0 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page