Searching and indexing based on zope.index
A Python indexing and searching system based on zope.index.
See the docs subdirectory for documentation or the online docs.
- Fix bug in query with names are None [ebrehault].
- The ‘numdocs’ return value from ‘Catalog.search’ is now a subclass of integer with an added attribute, ‘total’, which gives the caller access to the total size of the result set before being limited by the ‘limit’ argument.
- Fix docs rendering.
- Add “dev” and “docs” aliases to setup.cfg.
- Harmonized index name / discriminator in catalog setup with content schema and queries used in later examples. Thanks to Douglas Cerna.
- Fixed semantics of ‘applyEq’ for the Keyword index. The ‘applyEq’ (and ‘applyNotEq’) methods now expect scalar values for the query. ‘applyAny’, ‘applyAll’ and their respective negations should be used to query multiple values.
- Fixed bug in query parser where names inside of ‘any’ or ‘all’ calls were not being parsed properly.
- When using CQEs, names may now be embedded inside of a list or tuple, allowing expressions like this: “tags in any([tag1, tag2])” where tag1 and tag2 are names that will be resolved at query run time.
- Deprecated .search-based querying of catalog in favor of newer query language(s) via .query. See docs for more info.
- Added a .docids() method to CatalogIndex.
- Deprecated repoze.catalog.index.path2.CatalogPathIndex2.
- Refuse to pickle instances of zodb.broken.Broken which are returned as the result of an index attempting to call a callable discriminator function. This prevents edge cases where instances with out-of-sync code could too easily write Broken objects to the database, leaving them around for the next hapless unpickler to find. This is particularly a problem because many of our discriminators are functions; the Broken machinery attempts to recover the state of a broken object when it is unpickled, but it cannot recover the state of an attempted function call.
- Fix a bug which allowed the document map to map more than one docid onto the same address, which put the reverse and forward document map bookeeping out of sync with each other.
- Move benchmark directory into root of distribution instead of leaving it in package. It interfered with normal operations of the setuptools testrunner.
- Deactivate “sortbench” console script.
- Remove “Development Status” Trove classifier.
- Bump version to 0.7.2.
- Disuse nose.collector as test_suite (use normal setuptools test runner).
- Remove (unused) TODO.txt.
- Disuse ez_setup.py.
- Make docs render properly in Google Chrome.
- When using DocumentMap.add with a docid argument that references an existing docid with metadata, that metadata is now removed. Previously it was retained.
- Minimally get docs into shape for First PyPI release; no functionality changes from 0.7.0.
- Fixed bug in DocumentMap.add which left orphan mappings for previous addresses when adding an existing docid with a new address.
- Added the ability to sort by text relevance. Use the name of the text index as the sort_index in the query.
- Add metadata-related APIs to repoze.catalog.document.DocumentMap: add_metadata, remove_metadata, get_metadata. “Metadata” is a freeform set of key/value pairs related to a docid. See the API documentation for more information.
- Fixed constructor inheritance issues which kept repoze.catalog from working under Python 2.6. Note that this change involved removing the *args, **kw arguments from index constructors: those values were never used, but had (bogus) tests.
- N-Best ascending fieldindex sort was being chosen incorrectly when there was no limit. Symptom: RuntimeError, 'n-best used without limit'.
- Add reindex_doc method as an alias for index_doc to both CatalogFieldIndex and CatalogKeywordIndex (for performance, index_doc for both indexes has special case code for reindexing).
- Speed up path2 index attribute search by using __getitem__ rather than .get in some places.
- Override textindex reindex_doc method: calling index_doc only instead of calling unindex_doc and then index_doc is much more efficient.
- Attributes returned to attribute checker were not correct.
- Add “attribute discriminator” and “attribute checker” support to path2 index. If an index is created with an attribute discriminator, when any object is indexed, the value of the attribute will be stored in the path index. The path index will know that that attribute belongs to a particular path. Later, when the “attribute checker” feature of the apply or search method is used, a user-supplied attribute checker function will be able to filter the result set returned by the index. This is used by the author primarily to support security-filtered searches of a path index. It is not otherwise documented.
- Add a reindex_doc method to the catalog and to the common shared index base class. The catalog’s reindex_doc calls each index’s reindex_doc method when called. The common shared index base class implementation unindexes the docid and then subsequently indexes the document using the docid. This method can be overridden for specific indexes to do something different on a reindex.
- repoze.catalog.indexes.path2.CatalogPathIndex2 now takes an extra argument to its search method named include_path. If this is true, the docid set returned will include the docid for the path specified by the path query parameter. The apply method of the index allows for the specification of the include_path as a dictionary member in an apply call which specifies the query as a dictionary.
- Give path2.CatalogPathIndex2 index a better reindex_doc method than the default.
- The CatalogKeywordIndex’s apply method mutated the query passed in if it was a dict. To fix, we override the apply method from the zope.index implementation.
- Added a Range class importable as repoze.catalog.Range. The Range class should be used to represent a range query to a CatalogFieldIndex. The old-style of passing a 2-tuple to represent a range is still supported, but will be eventually removed in favor of requring a Range object to represent a Range query. A Range object can be instantiated ala “Range(start, end)”.
- It is now possible to pass a sequence of items to the CatalogFieldIndex apply method. When a sequence of terms that is passed in is not a tuple with two items in it (the previous API representing a range, which is deprecated), it will be considered a query for multiple terms. The docids returned for each term will be unioned together to form the result.
- It is now possible to pass a dictionary to the CatalogFieldIndex apply method. When a dictionary is passed, the member of the dictionary named query is treated as the query. It may be a single term, a sequence of terms, or a Range. An additional dictionary member named operator may also be specified: when this is specified, it must be one of or or and (the default is or). If the query specifies multiple terms, and the operator is or, the results will be unioned; if the query specifies multiple terms and the operator is and, the results will be intersected.
- A newer path index implementation importable as repoze.catalog.path2.CatalogPathIndex2 has been added as another index type. The path2 index type is an improvement inasmuch as it actually uses a graph to represent structure instead of the “levels” scheme pioneered within Zope2 (and used by repoze.catalog.path.CatalogPathIndex). By eye, the “levels” scheme looks like it can return the wrong results for any given path for a sufficiently dense tree.
- Catalog indexes must now supply an apply_intersect method; it receives a query and a set of docids (the result intersection “so far”). It should have the same sort of return value as the apply method. Indexes which inherit from common.CatalogIndex will inherit a default implementation.
- It is now possible to specify index query/merge order within a catalog query. See Index Query/Merge Order in the docs.
- Better detection of when to use fwdscan on ascending sorts in field indexes.
- Better detection of when to use nbest vs. timsort on ascending sorts in field indexes.
- Allow a new catalog search method keyword: sort_type. For ascending sorts, this can be one of nbest, fwscan, or timsort. For descending sorts, only nbest and timsort are supported. This argument allows fine-grained control of what algorithm should be chosen to perform sorting within FieldIndex code.
- Better automatic detection of which sort algorithm to use (when it’s not supplied via sort_type) based on empirical testing.
- Depend on zope.index 3.5.0 rather than any earlier version (repoze.catalog fixes migrated upstream in zope.index 3.5.0).
- Add ‘sortbench’ script to test various field index sort strategies (requires ‘benchmark’ extra to create charts).
- Prevent the potential for a zero division error when attempting to sort an empty set of results.
- Optimize the choice of fieldindex sort strategy.
- Speed up keyword index merges slightly.
- Fix a bug in the return value of the catalog: it would try to return the minimum of the number of docs or the limit event if there was no limit.
- Sean Upton pointed out that the document map code artificially limited the number of documents to half the number that it could actually handle.
- Add path index.
- Speed up keyword index ‘and’ (intersection) queries nominally by sorting intersected sets from smallest-to-largest first.
- Benchmarking suite provided by Chris Rossi.
- Add a “facet” index (repoze.catalog.indexes.facet.CatalogFacetIndex). This index is much like a keyword index, but unlike a keyword index it contains a facet list (a sequence of known colon-separated values) and accepts values that are sequences of colon-separated terms. Each term is split on its colons, forming a sequence of categories, then each concatenation of the categories is indexed. For example, if you indexed a document as ['style:gucci:handbag'], and the facet list contained 'style', 'style:gucci' and 'style:gucci:handbag', the document would be indexed three times: as style, as style:gucci and as style:gucci:handbag. Querying a facet index returns a set of document ids that match the facets passed in. A facet index also has a counts method which provided a set of document ids, returns a dictionary containing “further constraint information” for use in a narrowing UI. This count implementation is not meant for very large-scale sites; it is naive.
- Speed up keyword index ‘or’ (union) queries by using a single IFBTree.multiunion instead of multiple calls to IFBTree.union; this is most helpful for speeding up ‘or’ queries where there are lots of terms in the query sequence.
- Add overview page.
- Add repoze.catalog.document.DocumentMap class, which provides a mechanism to map “addresses” (paths) to document ids.
- Add API documentation for catalog and document map.
- Rename searchResults method to search.
- Removed updateIndex and updateIndexes methods of catalog.
- All index implementations moved into repoze.catalog.indexes.
- All interfaces moved to repoze.catalog.interfaces.
- Provide sort_index capability.
- Initial release.