Skip to main content

No project description provided

Project description

Test and build

Elasticmagic query filters for attributes

Library to store, filter and build facets for custom attributes

The problem

Each attribute pair can be stored in the index as a nested document. We can use following mapping for that:

attrs:
  type: nested
  properties:
    attr_id:
      type: integer
    # usually only one of the next fields should be populated
    value_int:
      type: integer
    value_bool:
      type: boolean
    value_float:
      type: float

This makes it possible to filter documents by an attribute id and its value (for example we want to find all the documents with attr_id = 1234 and value = 5678):

query:
  bool:
    filter:
    - nested:
        path: attrs
        query:
          bool:
            must:
            - term:
                attrs.attr_id: 1234
            - term:
                attrs.value_int: 5678

It is also possible to build a facets for all attributes at once:

aggs:
  attrs_nested:
    nested:
      path: attrs
    aggs:
      attrs:
        terms:
          field: attrs.attr_id
        aggs:
          values:
            field: attrs.value_int

or for a single attribute:

aggs:
  attrs_nested:
    nested:
      path: attrs
    aggs:
      attr_1234:
        filter:
          term:
            attrs.attr_id: 1234
        aggs:
          values:
            field: attrs.value_int

But nested documents have some drawbacks. Every nested document is stored in the index as different document. For instance, next document will be stored as 5 lucene documents:

name: "I'm a document with nested attributes"
attrs:
- attr_id: 1
  value_int: 42
- attr_id: 2
  value_int: 43
- attr_id: 3
  value_bool: true
- attr_id: 4
  value_float: 99.9

Nested queries are slow by itself:

In particular, joins should be avoided. nested can make queries several times slower and parent-child relations can make queries hundreds of times slower.

But what is worse regular queries are also slower when there are nested documents in the index. It is because of all the fields of main documents becomes sparse. This in turn degrades performance of all filters and accesses to doc_values.

The solution

The idea is to encode pair of an attribute id and a corresponding value into a single value. If our attribute ids are 32-bit integers and all value types also fit into 32 bits we can store them as a single 64-bit value.

So our mapping can be:

attrs:
  type: object
  properties:
    int:
      type: long
    bool:
      type: long
    float:
      type: long

Document with encoded attributes:

name: "I'm a document with packed attributes"
attrs:
# (1 << 32) | 42
- int: 0x1_0000002a
# (2 << 32) | 43
- int: 0x2_0000002b
# (3 << 1) | 1
- bool: 0x7
# (4 << 32) | {integer representation of 99.9}
# (4 << 32) | struct.unpack('=I', struct.pack('=f', 99.9))[0]
- float: 0x4_42c7cccd

Now with a bit of bit magic we can emulate nested queries.

Filtering by attribute id 1234 with value 5678:

query:
  bool:
    filter:
    - term:
        attrs.int: 0x4d2_0000162e

Building facet for all attribute values:

aggs:
  attrs_int:
    terms:
      field: attrs.int
      # specify big enough aggregation size
      # so all flat attrite values should fit
      size: 10000

One more step that client should do is to decode and group values by attribute id.

How to use it in python

from elasticsearch import Elasticsearch

from elasticmagic import Cluster, Document, Field
from elasticmagic.types import List, Long
from elasticmagic.ext.queryfilter import QueryFilter

from elasticmagic_qf_attrs import AttrBoolFacetFilter
from elasticmagic_qf_attrs import AttrIntFacetFilter
from elasticmagic_qf_attrs import AttrRangeFacetFilter
from elasticmagic_qf_attrs.util import merge_attr_value_bool
from elasticmagic_qf_attrs.util import merge_attr_value_float
from elasticmagic_qf_attrs.util import merge_attr_value_int

# Specify document
class AttrsDocument(Document):
    __doc_type__ = 'attrs'

    ints = Field(List(Long))
    bools = Field(List(Long))
    floats = Field(List(Long))

# Create an index
index_name = 'test-attrs'
client = Elasticsearch()
client.indices.create(index=index_name)
cluster = Cluster(client)
index = cluster.get_index(index_name)
index.put_mapping(AttrsDocument)

# Index example document
index.add([
    AttrsDocument(
        ints=[
            merge_attr_value_int(1, 42),
            merge_attr_value_int(2, 43),
        ],
        bools=[merge_attr_value_bool(3, True)],
        floats=[merge_attr_value_float(4, 99.9)],
    ),
], refresh=True)

# Define a query filter
class AttrsQueryFilter(QueryFilter):
    ints = AttrIntFacetFilter(AttrsDocument.ints, alias='a')
    bools = AttrBoolFacetFilter(AttrsDocument.bools, alias='a')
    ranges = AttrRangeFacetFilter(AttrsDocument.floats, alias='a')

# Now we can build facets
qf = AttrsQueryFilter()
sq = index.search_query()
sq = qf.apply(sq, {})
res = sq.get_result()
assert res.total == 1
qf_res = qf.process_result(res)

# And finally lets print results
for attr_id, facet in qf_res.ints.facets.items():
    print(f'> {attr_id}:')
    for facet_value in facet.all_values:
        print(f'  {facet_value.value}: ({facet_value.count_text})')

for attr_id, facet in qf_res.bools.facets.items():
    print(f'> {attr_id}:')
    for facet_value in facet.all_values:
        print(f'  {facet_value.value}: ({facet_value.count_text})')

for attr_id, facet in qf_res.ranges.facets.items():
    print(f'> {attr_id}: ({facet.count})')

# Also we can filter documents:
qf = AttrsQueryFilter()
sq = index.search_query()
sq = qf.apply(
    sq,
    {
        'a1': '42',
        'a3': 'true',
        'a4__lte': '100',
    }
)
res = sq.get_result()
assert res.total == 1

qf = AttrsQueryFilter()
sq = index.search_query()
sq = qf.apply(
    sq,
    {
        'a4__gte': '100',
    }
)
res = sq.get_result()
assert res.total == 0

This script should print:

> 1:
  42: (1)
> 2:
  43: (1)
> 3:
  True: (1)
> 4: (1)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elasticmagic-qf-attrs-0.1.4.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

elasticmagic_qf_attrs-0.1.4-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file elasticmagic-qf-attrs-0.1.4.tar.gz.

File metadata

  • Download URL: elasticmagic-qf-attrs-0.1.4.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.12 Linux/5.11.0-1022-azure

File hashes

Hashes for elasticmagic-qf-attrs-0.1.4.tar.gz
Algorithm Hash digest
SHA256 251bac5abdf3f8bf42085fd66aa4dce715c995fc517772969d69ef536cdb3571
MD5 6ea6498fbacd92c44e494e8e1f38d6ae
BLAKE2b-256 cca911e2a8db9f909404d9a272b6ff8f782efb9df35520f24cbb887b8fe011d9

See more details on using hashes here.

File details

Details for the file elasticmagic_qf_attrs-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for elasticmagic_qf_attrs-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5d80e37d0897bf1ce92247e5d79462286c347e8e1659d9eec2fc8e99ed57ab6e
MD5 d3c0e9c0498b57a5fe9e2a8882ea1456
BLAKE2b-256 712408f170defff537b973b2048d1c8aa44c59508b8f132d0e98982f8a7fcba8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page