Skip to main content

InvertedIndex implementation using hash lists (dictionaries)

Project description

https://img.shields.io/travis/MichaelAquilina/hashedindex.svg https://img.shields.io/pypi/v/hashedindex.svg

Fast and simple InvertedIndex implementation using hash lists (python dictionaries).

Features

hashedindex provides a simple to use inverted index structure that is flexible enough to work with all kinds of use cases.

Basic Usage:

import hashedindex
index = hashedindex.HashedIndex()

index.add_term_occurrence('hello', 'document1.txt')
index.add_term_occurrence('world', 'document1.txt')

index.get_documents('hello')
Counter({'document1.txt': 1})

index.items()
{'hello': Counter({'document1.txt': 1}),
'world': Counter({'document1.txt': 1})}

example = 'The Quick Brown Fox Jumps Over The Lazy Dog'

for term in example.split():
    index.add_term_occurrence(term, 'document2.txt')

The hashedindex is not limited to strings, any hashable object can be indexed.

index.add_term_occurrence('foo', 10)
index.add_term_occurrence(('fire', 'fox'), 90.2)

index.items()
{'foo': Counter({10: 1}), ('fire', 'fox'): Counter({90.2: 1})}

The initial idea behind hashedindex is to provide a really quick and easy way of generating matrices for machine learning with the additional use of numpy, pandas and scikit-learn. For example:

import hashedindex
import numpy as np

index = hashedindex.HashedIndex()

documents = ['spam1.txt', 'ham1.txt', 'spam2.txt']
for doc in documents:
    with open(doc, 'r') as fp:
         for term in fp.read().split():
             index.add_term_occurrence(term, doc)

# You *probably* want to use scipy.sparse.csr_matrix for better performance
X = np.as_matrix(index.generate_feature_matrix(mode='tfidf'))

y = []
for doc in index.documents():
    y.append(1 if 'spam' in doc else 0)
y = np.asarray(doc)

from sklearn.svm import SVC
classifier = SVC(kernel='linear')
classifier.fit(X, y)

You can also extend your feature matrix to a more verbose pandas DataFrame:

import pandas as pd
X  = index.generate_feature_matrix(mode='tfidf')
df = pd.DataFrame(X, columns=index.terms(), index=index.documents())

All methods within the code have high test coverage so you can be sure everything works as expected.

Found a bug? Nice, a bug found is a bug fixed. Open an Issue or better yet, open a pull request.

History

0.1.0 (2015-01-11)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hashedindex-0.3.0.tar.gz (17.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page