Opinionated wrapper round PySolr
Project description
PySolaar
A highly opinionated, Django-like wrapper round PySolr, for when you want to ridiculously de-normalise some complex data at runtime, and then query it with a pretty interface.
Features
Managing your Solr data
- Create document types with a Django-like class-based approach
- Define how a document instance should get its data. It's just a function you define, so how you get the data is up to you: from a database, by mushing together data from different Django classes, by HTTP request...
- Define a function for PySolaar to call in order to generate the documents...
- Declaratively define the structure of your documents, if you want to do any complex embedding or reuse documents.
- Press go!
- PySolaar automatically encodes, embeds, etc. all the data and pushes it to Solr.
- Fields are "namespaced" to document-type (like Python's
__name_mangling
), so no clashes! Theid
field's value is also prefixed with the class name, so it's unique for a specific class.
Querying your Solr data
- A nice Django QuerySet-like approach (wraps a modified version of SolrQ, which you can also use for more complex queries)
- Automatically prefixes all the queries to deal with the name-mangling.
- Declaratively define the document you want returned, including a set of transformations for unpacking data, turning dates back into Python
datetime
objects, or whatever you want. - Lazy query evaluation
Basics
Creating PySolaar Documents
PySolaar allows documents to be defined using Django-like classes, which represent entity types:
- Create an entity type, by subclassing
PySolaar
. - Define a
build_document
method:- This is "how you get the data to index".
- This method defines a single document or set of documents that correspond to a single identifier.
- It should take an identifier as an argument, and return a
self.Document
object or an iterable ofself.Document
objects. - Pass your data as key words to
self.Document
initialiser or unpack a dict.
- Define a
build_document_set
method that iterates through a series of identifiers and returns a call toself.build_document
for each identifier.
from pysolaar import PySolaar
# Create a thing that inherits from PySolaar
class Person(PySolaar):
# Write a `build_document` method -- this gets the data corresponding
# to a particular value of `identifier`
def build_document(self, identifier):
# Return an instance of self.Document containing the data
return self.Document(
id="person-{}".format(identifier),
name="Claudius the {}".format(identifier),
height=100 * identifier,
moustache=["No", "Yes"][bool(identifier % 2 == 0)] # It's odd not to have a moustache!
)
# Write `build_document_set` that produces an iterator of Person.build_document calls
def build_document_set(self):
for identifier in [1, 2, 3, 4, 5]:
yield self.build_document(identifier)
# Configure PySolaar by setting up the underlying PySolr instance
PySolaar.configure_pysolr("http://your-solr-instance")
# Then run PySolaar.update() to push data to Solr
PySolaar.update()
When PySolaar.update()
is called, PySolaar goes through all its subclasses' build_document_set
functions, in order to generate the documents, and then pushes them to Solr. Obviously, there's a reasonable amount of magic.
Querying the data
PySolaar provides a Django-like interface for querying data. Or just pass a SolrQ object to .filter()
from PySolaar import Q
from __above__ import Person
# Get all the persons
persons = Person.all()
# Filter by anything ...
claudiuses = Person.filter(name="Claudius")
# ... and chain QuerySets as in Django
tall_with_moustache = claudiuses.filter(height__gt=250, moustache="Yes")
# ... or use a Q object
either_tall_or_moustache = claudiuses.filter( Q(height__gt=250) | Q(moustache="Yes") )
# ... and paginate
first_page = tall_with_moustache.paginate(page_size=2, page_number=0)
# Results aren't evaluated until you need them:
tall_with_moustache.count() # -> 2
for c in first_page:
print(c["id"])
# And a few other features — see the Advanced section.
Restrictions when defining documents
Most of the restrictions here stem from the limitations of Solr and the PySolr library.
A single field can be contain:
- A value (string, int, datetime, etc.)
- A
list
of values (probably aset
as well) - A
dict
, whose values are either more dicts or values or lists of values. (Dicts are collapsed down to single value fields using double underscores, i.e.field={"one": "something", "two": "something else"}
becomesfield__one="something"
andfield__two="something else"
— to an arbitrary depth!) - NOT a list of dicts. To index a list of associated values (e.g. lists of dicts), instead use Child Documents.
Advanced features
The Meta
class (as borrowed from Django)
Each class can define a Meta
class, which can be used to declaratively define a number of aspects regarding how data is stored in Solr. (The Meta class is passed around in the background to apply settings where appropriate.)
Using the Meta
class, you can:
- Independently from the data-definition method (
build_document
), declare astore_document_fields
structure, defining which fields should be pushed to Solr and in which format. This allows thebuild_document
method to be a 'generic' method for getting whatever data is required and allows easy embedding and reuse. - Define a
return_document_fields
structure to limit the fields that are returned from Solr (so you can have fields that are just there for querying, but you're not interested in) - Or (older version, probably will be deprecated), independently define lists of fields such as
fields_as_child_docs
.
Child documents
The Solr "child field" feature is used to allow one document type to be nested inside another.
To associate a nested document with a particular field, first define a fields_as_child_docs
list in the Meta
class and add the field name. Then set the value of the parent field to Child.items([identifiers])
. See the example below.
from pysolaar import PySolaar
class Person(PySolaar):
class Meta:
# Define a Meta class with `fields_as_child_docs`
# in order to declare a field is a child doc
fields_as_child_docs = ["pets"]
def build_document(self, identifier):
return self.Document(
id="thing-{}".format(identifier),
name="Claudius",
# Embed another document type by calling Class.items([identifiers])
pets=Pets.items([1, 2, 3])
)
class Pets(PySolaar):
def build_document(self, identifier):
return self.Document(
id="pet-{}".format(identifier),
name="Gordon"
)
In the background, PySolaar.items
calls build_document
with the listed identifiers, and embeds the document as a child (using PySolr's _doc
keyword).
Nesting searchable documents works up to one level of embedding. After this, documents can be stored as JSON strings and recovered automatically, but not queried. After three levels of embedding a particular type, this embedding will stop (preventing infinite recursion).
store_document_fields
In general, the build_document
function should define all the fields required for every use-case. The results of calling this function with a particular identifier are cached, so the data can be re-used if embedded as a child document elsewhere (n.b. the cache is cleared after an update!)
DocumentFields
classes provides a convenient way to describe how the document should be stored in Solr.
Set the store_document_fields
value in your class Meta
to a DocumentFields
instance where you list the fields you want to include (set them to True
).
Also use this structure to declare child fields using ChildDocument
(in place of using the fields_as_child_docs
, as above) and to control which fields in the ChildDocument
are included.
This is useful (as in the example below) as we can have Person
documents with their pets' names embedded as searchable child documents, but also have more detailed Pets
documents (in turn with certain owner
fields).
import datetime
from pysolaar import PySolaar, DocumentFields, ChildDocument
class Person(PySolaar):
class Meta:
store_document_fields = (
DocumentFields(
name=True,
school=True,
work=True,
has_pets=ChildDocument( # Here we can start selecting fields from the Pets class
name=True,
weight=True,
# date_of_birth=True ... we don't care about, so omit it
owner=ChildDocument(
name=True, # Here, we embed a Person instance again, this time just selecting the name
# n.b. this `owner` field will be converted to JSON for storage (see above)
),
),
),
)
def build_document(self, identifier):
return self.Document(
id="thing-{}".format(identifier),
name="Claudius",
school="St Somethings",
work="Bus driver",
has_pets=Pets.items([1, 2, 3])
)
class Pets(PySolaar):
def build_document(self, identifier):
return self.Document(
id="pet-{}".format(identifier),
name="Gordon"
weight=123,
date_of_birth=datetime.datetime(1996, 1, 3),
# Re-embed the Person as the owner, as it might be useful
# if we ever want a 'Pets' top-level document.
owner=Person.items(f"owner_of_{identifier}")
)
store_document_field
allows the following structures:
DocumentFields
: this is the root wrapper for the whole documentChildDocument
: embeds another document (with the specified fields) as a Solr child documentJsonChildDocument
: embeds a child document by converting it to a JSON string. It can be returned and unpacked back to Python, but not queried in Solr (except as a hit-and-miss string-matching exercise...)SplattedChildDocument
: embeds a child document as a list of searchable fields (it works recursively through the child document as a dict, accumulating all the values in a list). Useful for creating a searchable version of a child document, where you don't care about any field in particular, just matching something. Probably don't return this from Solr — unlikeJsonChildDocument
it cannot be reverted back to anything particularly useful.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.