Skip to main content
Join the official Python Developers Survey 2018 and win valuable prizes: Start the survey!

Text retrieval and analytics engine.

Project description

What is Caterpillar?

Caterpillar is a pure python text indexing and analytics library. Some features include:

  • pluggable key/value object store for storage (currently only implementation is SQLite)
  • transaction layer for reading/writing (along with associated locking semantics)
  • supports searching indexes with some built in scoring algorithm implementations (including TF/IDF)
  • stores additional data structures for analytics above and beyond traditional information retrieval data structures
  • has a plugin architecture for quickly accessing the data structures and performing custom analytics
  • has 100% test coverage

Quick Example

Quick example of using caterpillar below:

import os
import tempfile

from caterpillar.processing.index import IndexWriter, IndexConfig
from caterpillar.processing.schema import TEXT, Schema, NUMERIC
from import SqliteStorage

index_dir = os.path.join(tempfile.mkdtemp(), "examples")
with open('caterpillar/test_resources/moby.txt', 'r') as f:
    data =
    with IndexWriter(index_dir, IndexConfig(SqliteStorage, Schema(text=TEXT, some_number=NUMERIC))) as writer:
        writer.add_document(text=data, some_number=1)


pip install caterpillar


The documentation can be found here.


We are working on porting our issues from our internal issue tracker over to a more visible system. But, for the time being, here is a general roadmap:

  • Move to (possibly only) Python 3 (see below).
  • Revamp schema and field design.
  • Add a memory storage implementation.
  • Revamp query design.
  • Remove the NLTK dependency (great library, but only used for tokenisation).
  • Switch index structures over to a more efficient data structure (possibly numpy arrays or similar).

The current plan is to move to using GitHub issues with HuBoard, but stay tuned.

Python Version

Currently Python 2.7+ only. Work is underway to support Python 3+. WARNING: Caterpillar might become Python 3+ only in the future. Stay tuned.


Anyone who is willing! In other words none yet, but we are more then accepting of contributions.


Not code will be merged unless it has 100% test coverage and passes pep8. We code with a line length of 120 characters (see tox.ini [pep8] section) and we use py.test for testing. Tests are in a test sub-folder in each package. We generally run coverage as follows:

coverage erase; coverage run --source caterpillar -m py.test -v caterpillar; coverage report

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
caterpillar-1.0.0.dev17-py2-none-any.whl (79.6 kB) Copy SHA256 hash SHA256 Wheel py2 Mar 20, 2017
caterpillar-1.0.0.dev17.tar.gz (59.5 kB) Copy SHA256 hash SHA256 Source None Mar 20, 2017

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page