repoze.pgtextindex is an indexing plugin for repoze.catalog
that provides a text search engine based on the powerful text indexing
capabilities of PostgreSQL 8.4 and above. It is designed to take the
place of any text search index based on zope.index. Installation
typically requires few or no changes to code that already uses
The advantages of repoze.pgtextindex over zope.index.text
- Performance. For large datasets, repoze.pgtextindex can be
orders of magnitude faster than zope.index, mainly because
repoze.pgtextindex does not have the overhead of unpickling
objects that zope.index has.
- Lower RAM consumption. Users of zope.index work around the
unpickling overhead by keeping large caches of unpickled objects
in RAM. Even worse, each thread keeps its own copy of the object
cache. PostgreSQL, on the other hand, does not need to maintain
complex structures in RAM. The PostgreSQL process size tends to
be constant and reasonable.
- Maintenance. The text indexing features of PostgreSQL are well
documented and receive a great deal of active maintenance, while
zope.index has not received much developer attention for
repoze.pgtextindex does not cause PostgreSQL to be involved in
every catalog query and update. Only operations that use or change the
text index hit PostgreSQL.
repoze.pgtextindex is used just like any other index in
from repoze.pgtextindex import PGTextIndex
index = PGTextIndex(
The arguments to the constructor are as follows:
- The repoze.catalog discriminator for this index. For more
information on discriminators see the repoze.catalog documentation.
This argument is required.
- The connection string for connecting to PostgreSQL. This argument is
- The table to use for the index. The default is ‘pgtextindex’.
- The PostgreSQL text search configuration to use for the index. The
default is ‘english’ which is the default built in configuration which
ships with PostgreSQL. For more information on text search
configuration, see the PostgreSQL full text search documentation.
- If True the table and index used will dropped (if it exists) and
(re)created. The default is False.
- The maximum number of characters to index per document. The default is
1048575 (2**20 - 1), which is the maximum allowed by the to_tsvector
function. Reduce this to improve query speed, since the
ts_rank_cd function retrieves and decompresses entire TOAST tuples
- WeightedQueries can now be used as query result caches, making it
possible to search the catalog many times while hitting the text
index only once.
- When a query generates a large number of results, pgtextindex now disables
the expensive text ranking for that query. The max_ranked attribute
controls the threshold for disabling ranking. The default max_ranked
value is 6000.
- Improved speed by using BTrees instead of Buckets and by using
cursor.fetchall() instead of iter(cursor).
- Changed the ‘marker’ column to an array and changed the ‘marker’ attribute of
‘IWeightedQuery’ to accept either a single marker string or a sequence of
marker strings. Since the database schema has changed,
‘PGTextIndex.upgrade()’ will need to be run on any indexes created with an
older version of the code. (LP #1353483)
- Improved query speed by about 10% by duplicating the query parameter
rather than joining with the query.
- Added the maxlen option to allow a configurable document size limit.
- Handle concurrent index updates cleanly.
- Retry on IntegrityError to avoid meaningless errors.
- Added metrics using the perfmetrics package.
- Switched to read committed isolation and removed explicit locking.
The explicit locking was reducing write performance and may have been
interfering with autovacuum. This change raises the probability
of temporary inconsistency, but since this package did not provide
ACID compliance anyway, developers already need to be prepared for
- Truncate text to 1MB per document in order to stay under (silly) limit
imposed by PostgreSQL.
- Fixed PostgreSQL ProgrammingError when query string contains a backslash
character. (LP #798725)
- Added ability to mark content with arbitrary markers which can be used as
discriminators at query time. (LP #792334)
- Support searches for words containing an apostrophe. (LP #801265)
- Reworked the scoring method: added a per-document score coefficient.
The score coefficient can boost the score of documents known to be
- Added the IWeightedText interface. The discriminator function can
return an IWeightedText instance to control the weights and
- Added the IWeightedQuery interface. Text index queries can
pass an IWeightedQuery instance to control the weight values.
- Allow persistent objects to be indexed, since the usual objection
(accidental ZODB references) does not apply.
- Do not drop and create the table by default, making PGTextIndex
easier to use outside ZODB.
- Added the ‘get_contextual_summaries’ and ‘get_contextual_summary’
methods to the index.
- Compatability with repoze.catalog 0.8.0.
TODO: Brief introduction on what you do with files - including link to relevant help section.