Skip to main content

A Python library for algebraic search.

Project description

Boolean and Fuzzy Boolean Query Framework

NOTE: Very early stage. Intend to implement things like fuzzy boolean searching over JSON fields, etc. It should have minimal dependencies and work without building a database (the raw files will be the representation of the database). All queries are JSON already, so it should be easy to integrate it in web APIs. The below documentation is not exactly right, since I just asked an LLM to generate it and I haven't had enough time to refine it and fix it. The foundation of this work is based on Boolean algebras and homomorphisms between the queries and the results. It will have a robust command line tool for doing all of this from the command line, but it'll be a fully developed library that can be integrated into any project.

A flexible Python framework for constructing, combining, and evaluating Boolean and Fuzzy Boolean queries against a collection of documents. This framework supports both strict binary evaluations and degrees-of-membership evaluations, laying the groundwork for advanced fuzzy set-theoretic query capabilities.

Table of Contents

Features

  • Boolean Query Construction: Create complex Boolean queries using AND, OR, and NOT operations.
  • Fuzzy Boolean Query Construction: Extend Boolean queries with fuzzy logic operations and modifiers like very and somewhat.
  • Operator Overloading: Combine queries intuitively using Python's bitwise operators (&, |, ~).
  • Unified Evaluation Results: Utilize a single ResultQuery class to handle both Boolean and Fuzzy Boolean query evaluations.
  • TF-IDF Integration: Utilize Term Frequency-Inverse Document Frequency (TF-IDF) for sophisticated scoring in fuzzy queries.
  • Extensible Design: Easily extend the framework with additional modifiers or integrate more complex scoring mechanisms.

Installation

Ensure you have Python 3.7 or higher installed.

Clone the repository:

git clone https://github.com/queelius/algebraic_search.git
cd algebraic_search

Usage

Boolean Queries

Create and evaluate strict Boolean queries to determine document matches based on term presence.

from boolean_query import BooleanQuery, ResultQuery

# Initialize BooleanQuery instances
q1 = BooleanQuery("cat dog") # same as BooleanQuery("(and cat dog)")
q2 = BooleanQuery("(or fish bird)")
q3 = ~q2
q4 = q1 & q3  # Represents "(and (and cat dog) (not (or fish bird)))"

# Example documents as sets
documents = [
    {"cat", "dog"},
    {"fish"},
    {"bird"},
    {"cat", "dog", "fish"},
    {"cat", "dog", "bird"},
    {"cat"},
    {"dog"},
    {"fish", "bird"},
    {"cat", "dog", "fish", "bird"},
]

# Evaluate queries against documents
results = q4.eval(documents)
print(q4)
# Output: (and (and cat dog) (not (or fish bird)))
print(results)
# Output: ResultQuery([1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

Fuzzy Boolean Queries

Construct and evaluate fuzzy Boolean queries that consider degrees of term relevance using TF-IDF scores.

from fuzzy_boolean_query import FuzzyBooleanQuery, ResultQuery

# Sample documents
documents_text = [
    "cat dog",
    "fish",
    "bird",
    "cat dog fish",
    "cat dog bird",
    "cat",
    "dog",
    "fish bird",
    "cat dog fish bird",
]

# Compute TF-IDF scores
tfidf_scores, vocabulary = FuzzyBooleanQuery.compute_tfidf(documents_text)

# Initialize FuzzyBooleanQuery with TF-IDF data
q1_fuzzy = FuzzyBooleanQuery("cat dog", tfidf_matrix=tfidf_scores, vocabulary=vocabulary)
q2_fuzzy = FuzzyBooleanQuery("(or fish bird)", tfidf_matrix=tfidf_scores, vocabulary=vocabulary)
q3_fuzzy = ~q2_fuzzy
q4_fuzzy = q1_fuzzy & q3_fuzzy  # Represents "(and (and cat dog) (not (or fish bird)))"

# Evaluate queries against documents
results = q4_fuzzy.eval(documents_text)
print(q4_fuzzy) # Output: (and (and cat dog) (not (or fish bird)))
print(results)  # Output: ResultQuery([...]) with float scores

API Documentation

BooleanQuery

A class for constructing and evaluating strict Boolean queries.

Initialization

BooleanQuery(query: Union[str, List] = None)
  • query: A query string (e.g., "cat dog", "(or fish bird)") or a list of tokens. Defaults to an empty query.

Methods

  • tokenize(query: str) -> List: Parses the query string into a nested list structure.
  • eval(docs: Union[List[Set[str]], Set[str]]) -> ResultQuery: Evaluates the query against a list of documents.
  • Operator Overloads:
    • &: Logical AND
    • |: Logical OR
    • ~: Logical NOT

Example

q = BooleanQuery("(and cat dog)")
result = q.eval([{"cat", "dog"}, {"cat"}, {"dog"}, {"cat", "dog"}])
print(result)  # ResultQuery([1.0, 0.0, 0.0, 1.0])

FuzzyBooleanQuery

A class for constructing and evaluating fuzzy Boolean queries with degrees of membership.

Initialization

FuzzyBooleanQuery(query: Union[str, List])
  • query: A query string or list of tokens.

Methods

  • tokenize(query: str) -> List: Parses the query string into a nested list structure, recognizing fuzzy modifiers.
  • eval(docs: List[str], ranker: Callable) -> ResultQuery: Evaluates the fuzzy query against a list of documents using some ranker, like a normalized TF-IDF score.
  • Operator Overloads:
    • &: Logical AND (fuzzy)
    • |: Logical OR (fuzzy)
    • ~: Logical NOT (fuzzy)

Example

q_fuzzy = FuzzyBooleanQuery("cat dog")
result_fuzzy = q_fuzzy.eval(docs, ranker=lambda term, doc: 1.0 if term in doc else 0.0)
print(result_fuzzy)  # ResultQuery([...])

ResultQuery

Represents the results of evaluating both BooleanQuery and FuzzyBooleanQuery.

Attributes

  • scores: A list of floats (0.0 or 1.0 for Boolean queries, 0.0 to 1.0 for Fuzzy Boolean queries) indicating the degree of membership for each document.

Methods

  • Operator Overloads:
    • &: Element-wise minimum
    • |: Element-wise maximum
    • ~: Element-wise complement

Example

r1 = ResultQuery([1.0, 0.0, 1.0])
r2 = ResultQuery([0.5, 1.0, 0.0])  # For BooleanQuery, these should be 0.0 or 1.0
combined = r1 & r2  # ResultQuery([0.5, 0.0, 0.0])

Note: In BooleanQuery, scores should be strictly 0.0 or 1.0. For FuzzyBooleanQuery, scores range between 0.0 and 1.0.

Formal Theory

Boolean Algebra for Queries and Results

Query Algebra (Q)

  • Elements: Q = P(T*) where T* is the set of all finite strings composed of ASCII characters. P(T*) is the power set of T*.
  • Operations:
    • AND (&): Intersection of two subsets.
    • OR (|): Union of two subsets.
    • NOT (~): Complement of a subset relative to T*.
  • Constants:
    • Empty Set ({}): Matches no documents.
    • Universal Set (T*): Matches all documents.

Result Algebra (R)

  • Elements: R = [r_1, r_2, ..., r_n] where each r_i ∈ {0.0, 1.0} for Boolean queries or r_i ∈ [0.0, 1.0] for Fuzzy Boolean queries.
  • Operations:
    • AND (&): Element-wise minimum.
    • OR (|): Element-wise maximum.
    • NOT (~): Element-wise complement (1.0 - r_i).

Homomorphism

The evaluation functions BooleanQuery.eval and FuzzyBooleanQuery.eval serve as homomorphisms φ: Q → R and φ_f: Q_f → R_f, preserving the algebraic structure:

  • φ(Q1 & Q2) = φ(Q1) & φ(Q2)
  • φ(Q1 | Q2) = φ(Q1) | φ(Q2)
  • φ(~Q1) = ~φ(Q1)
  • Similarly for fuzzy queries.

Fuzzy Boolean Algebra for Queries and Results

Fuzzy Query Algebra (Q_f)

  • Elements: Q_f = P(T*) similar to Boolean queries.
  • Operations:
    • AND (&): Fuzzy intersection using minimum.
    • OR (|): Fuzzy union using maximum.
    • NOT (~): Fuzzy complement (1.0 - x).
    • Modifiers: Such as very, somewhat, etc., to adjust degrees of membership.

Fuzzy Result Algebra (R_f)

  • Elements: R_f = [r_1, r_2, ..., r_n] where each r_i ∈ [0.0, 1.0].
  • Operations:
    • AND (&): Element-wise minimum.
    • OR (|): Element-wise maximum.
    • NOT (~): Element-wise complement (1.0 - r_i).

Homomorphism

The evaluation function eval in both BooleanQuery and FuzzyBooleanQuery acts as a homomorphism φ: Q → R and φ_f: Q_f → R_f, preserving:

  • Preservation of Operations:
    • φ(Q1 & Q2) = φ(Q1) & φ(Q2)
    • φ(Q1 | Q2) = φ(Q1) | φ(Q2)
    • φ(~Q1) = ~φ(Q1)
  • Preservation of Modifiers:
    • φ_f(very Q) = very φ_f(Q)
    • φ_f(somewhat Q) = somewhat φ_f(Q)

This ensures that the logical and fuzzy structures of queries are faithfully represented in the evaluation results.

Future Enhancements

  • Advanced Fuzzy Operators: Incorporate more sophisticated fuzzy logic operators and modifiers.
  • Custom Scoring Mechanisms: Implement alternative scoring strategies beyond TF-IDF, such as BM25.
  • Caching and Optimization: Optimize performance for large document collections through caching and efficient data structures.
  • Extended Query Language: Support more complex query constructs and natural language processing features.
  • Integration with Databases and Search Engines: Facilitate seamless integration with existing data storage and retrieval systems.

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the Repository: Click the "Fork" button on the repository page.
  2. Clone Your Fork:
    git clone https://github.com/queelius/algebraic_search.git
    cd algebraic_search
    
  3. Create a New Branch:
    git checkout -b feature/YourFeatureName
    
  4. Make Your Changes: Implement your feature or bug fix.
  5. Commit Your Changes:
    git commit -m "Add your detailed commit message"
    
  6. Push to Your Fork:
    git push origin feature/YourFeatureName
    
  7. Create a Pull Request: Navigate to the original repository and click "Compare & pull request."

Please ensure that your code adheres to the existing style and includes relevant tests.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

algebraic_search-0.1.1.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

algebraic_search-0.1.1-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file algebraic_search-0.1.1.tar.gz.

File metadata

  • Download URL: algebraic_search-0.1.1.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.5

File hashes

Hashes for algebraic_search-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e3f3d2fce4f825df62e7d3b69b9f136d39562a37603d393f118fb07632cb09d2
MD5 f0f3e578c0bb7ed19a56b32360ab6a69
BLAKE2b-256 b5115399f353cc0430befe4c253d7dbefdb96c054a6ac5eb1ac06ccc1f44a597

See more details on using hashes here.

File details

Details for the file algebraic_search-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for algebraic_search-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a48ebf3283df33b1fe8e8ba701451968bfd62e1967b16e138206ce72e5b09403
MD5 3e68feeeb5ffb93e4d419dfa85704b2a
BLAKE2b-256 062369220f0892293a7e35573d0aa9ba755613042df6ba4d1a270559d1aafac1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page