No project description provided
Project description
Elasticmagic query filters for attributes
Library to store, filter and build facets for custom attributes
The problem
Each attribute pair can be stored in the index as a nested document. We can use following mapping for that:
attrs:
type: nested
properties:
attr_id:
type: integer
# usually only one of the next fields should be populated
value_int:
type: integer
value_bool:
type: boolean
value_float:
type: float
This makes it possible to filter documents by an attribute id and its value
(for example we want to find all the documents with attr_id = 1234
and value = 5678
):
query:
bool:
filter:
- nested:
path: attrs
query:
bool:
must:
- term:
attrs.attr_id: 1234
- term:
attrs.value_int: 5678
It is also possible to build a facets for all attributes at once:
aggs:
attrs_nested:
nested:
path: attrs
aggs:
attrs:
terms:
field: attrs.attr_id
aggs:
values:
field: attrs.value_int
or for a single attribute:
aggs:
attrs_nested:
nested:
path: attrs
aggs:
attr_1234:
filter:
term:
attrs.attr_id: 1234
aggs:
values:
field: attrs.value_int
But nested documents have some drawbacks. Every nested document is stored in the index as different document. For instance, next document will be stored as 5 lucene documents:
name: "I'm a document with nested attributes"
attrs:
- attr_id: 1
value_int: 42
- attr_id: 2
value_int: 43
- attr_id: 3
value_bool: true
- attr_id: 4
value_float: 99.9
Nested queries are slow by itself:
In particular, joins should be avoided. nested can make queries several times slower and parent-child relations can make queries hundreds of times slower.
But what is worse regular queries are also slower when there are nested documents in the index. It is because of all the fields of main documents becomes sparse. This in turn degrades performance of all filters and accesses to doc_values.
The solution
The idea is to encode pair of an attribute id and a corresponding value into a single value. If our attribute ids are 32-bit integers and all value types also fit into 32 bits we can store them as a single 64-bit value.
So our mapping can be:
attrs:
type: object
properties:
int:
type: long
bool:
type: long
float:
type: long
Document with encoded attributes:
name: "I'm a document with packed attributes"
attrs:
# (1 << 32) | 42
- int: 0x1_0000002a
# (2 << 32) | 43
- int: 0x2_0000002b
# (3 << 1) | 1
- bool: 0x7
# (4 << 32) | {integer representation of 99.9}
# (4 << 32) | struct.unpack('=I', struct.pack('=f', 99.9))[0]
- float: 0x4_42c7cccd
Now with a bit of bit magic we can emulate nested queries.
Filtering by attribute id 1234
with value 5678
:
query:
bool:
filter:
- term:
attrs.int: 0x4d2_0000162e
Building facet for all attribute values:
aggs:
attrs_int:
terms:
field: attrs.int
# specify big enough aggregation size
# so all flat attrite values should fit
size: 10000
One more step that client should do is to decode and group values by attribute id.
How to use it in python
from elasticsearch import Elasticsearch
from elasticmagic import Cluster, Document, Field
from elasticmagic.types import List, Long
from elasticmagic.ext.queryfilter import QueryFilter
from elasticmagic_qf_attrs import AttrBoolFacetFilter
from elasticmagic_qf_attrs import AttrIntFacetFilter
from elasticmagic_qf_attrs import AttrRangeFacetFilter
from elasticmagic_qf_attrs.util import merge_attr_value_bool
from elasticmagic_qf_attrs.util import merge_attr_value_float
from elasticmagic_qf_attrs.util import merge_attr_value_int
# Specify document
class AttrsDocument(Document):
__doc_type__ = 'attrs'
ints = Field(List(Long))
bools = Field(List(Long))
floats = Field(List(Long))
# Create an index
index_name = 'test-attrs'
client = Elasticsearch()
client.indices.create(index=index_name)
cluster = Cluster(client)
index = cluster.get_index(index_name)
index.put_mapping(AttrsDocument)
# Index example document
index.add([
AttrsDocument(
ints=[
merge_attr_value_int(1, 42),
merge_attr_value_int(2, 43),
],
bools=[merge_attr_value_bool(3, True)],
floats=[merge_attr_value_float(4, 99.9)],
),
], refresh=True)
# Define a query filter
class AttrsQueryFilter(QueryFilter):
ints = AttrIntFacetFilter(AttrsDocument.ints, alias='a')
bools = AttrBoolFacetFilter(AttrsDocument.bools, alias='a')
ranges = AttrRangeFacetFilter(AttrsDocument.floats, alias='a')
# Now we can build facets
qf = AttrsQueryFilter()
sq = index.search_query()
sq = qf.apply(sq, {})
res = sq.get_result()
assert res.total == 1
qf_res = qf.process_result(res)
# And finally lets print results
for attr_id, facet in qf_res.ints.facets.items():
print(f'> {attr_id}:')
for facet_value in facet.all_values:
print(f' {facet_value.value}: ({facet_value.count_text})')
for attr_id, facet in qf_res.bools.facets.items():
print(f'> {attr_id}:')
for facet_value in facet.all_values:
print(f' {facet_value.value}: ({facet_value.count_text})')
for attr_id, facet in qf_res.ranges.facets.items():
print(f'> {attr_id}: ({facet.count})')
# Also we can filter documents:
qf = AttrsQueryFilter()
sq = index.search_query()
sq = qf.apply(
sq,
{
'a1': '42',
'a3': 'true',
'a4__lte': '100',
}
)
res = sq.get_result()
assert res.total == 1
qf = AttrsQueryFilter()
sq = index.search_query()
sq = qf.apply(
sq,
{
'a4__gte': '100',
}
)
res = sq.get_result()
assert res.total == 0
This script should print:
> 1:
42: (1)
> 2:
43: (1)
> 3:
True: (1)
> 4: (1)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for elasticmagic-qf-attrs-0.1.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6a59759f8f8d02f878468de25dd0cfe877e64e4b7fe1cf257251974c2866c1b |
|
MD5 | 1969bb23e9aa44fbcdafec7567e93485 |
|
BLAKE2b-256 | e43421c4440671b1ece5279432342be5a8871fc15e164bb28e2034aec6b4e9d3 |
Hashes for elasticmagic_qf_attrs-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f9747958cee7a3697231881c8c8d1f6b593de050fea146be477ce3e7d2deea8 |
|
MD5 | 5b95e1f3f3640b411be8e168d35466bf |
|
BLAKE2b-256 | 497fa1909cbdf860bed2f483185ac377561c0e6491d3a5b09b89c490eb25cba2 |