InvertedIndex implementation using hash lists (dictionaries)
Project description
Fast and simple InvertedIndex implementation using hash lists (python dictionaries).
Free software: BSD license
Documentation: https://hashedindex.readthedocs.org.
Features
hashedindex provides a simple to use inverted index structure that is flexible enough to work with all kinds of use cases.
Basic Usage:
import hashedindex
index = hashedindex.HashedIndex()
index.add_term_occurrence('hello', 'document1.txt')
index.add_term_occurrence('world', 'document1.txt')
index.get_documents('hello')
Counter({'document1.txt': 1})
index.items()
{'hello': Counter({'document1.txt': 1}),
'world': Counter({'document1.txt': 1})}
example = 'The Quick Brown Fox Jumps Over The Lazy Dog'
for term in example.split():
index.add_term_occurrence(term, 'document2.txt')
The hashedindex is not limited to strings, any hashable object can be indexed.
index.add_term_occurrence('foo', 10)
index.add_term_occurrence(('fire', 'fox'), 90.2)
index.items()
{'foo': Counter({10: 1}), ('fire', 'fox'): Counter({90.2: 1})}
The initial idea behind hashedindex is to provide a really quick and easy way of generating matrices for machine learning with the additional use of numpy, pandas and scikit-learn. For example:
import hashedindex
import numpy as np
index = hashedindex.HashedIndex()
documents = ['spam1.txt', 'ham1.txt', 'spam2.txt']
for doc in documents:
with open(doc, 'r') as fp:
for term in fp.read().split():
index.add_term_occurrence(term, doc)
# You *probably* want to use scipy.sparse.csr_matrix for better performance
X = np.as_matrix(index.generate_feature_matrix(mode='tfidf'))
y = []
for doc in index.documents():
y.append(1 if 'spam' in doc else 0)
y = np.asarray(doc)
from sklearn.svm import SVC
classifier = SVC(kernel='linear')
classifier.fit(X, y)
You can also extend your feature matrix to a more verbose pandas DataFrame:
import pandas as pd
X = index.generate_feature_matrix(mode='tfidf')
df = pd.DataFrame(X, columns=index.terms(), index=index.documents())
All methods within the code have high test coverage so you can be sure everything works as expected.
Found a bug? Nice, a bug found is a bug fixed. Open an Issue or better yet, open a pull request.
History
0.1.0 (2015-01-11)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.