A Python library for algebraic search.
Project description
Boolean and Fuzzy Boolean Query Framework
NOTE: Very early stage. Intend to implement things like fuzzy boolean searching over JSON fields, etc. It should have minimal dependencies and work without building a database (the raw files will be the representation of the database). All queries are JSON already, so it should be easy to integrate it in web APIs. The below documentation is not exactly right, since I just asked an LLM to generate it and I haven't had enough time to refine it and fix it. The foundation of this work is based on Boolean algebras and homomorphisms between the queries and the results. It will have a robust command line tool for doing all of this from the command line, but it'll be a fully developed library that can be integrated into any project.
A flexible Python framework for constructing, combining, and evaluating Boolean and Fuzzy Boolean queries against a collection of documents. This framework supports both strict binary evaluations and degrees-of-membership evaluations, laying the groundwork for advanced fuzzy set-theoretic query capabilities.
Table of Contents
- Features
- Installation
- Usage
- API Documentation
- Formal Theory
- Future Enhancements
- Contributing
- License
Features
- Boolean Query Construction: Create complex Boolean queries using
AND
,OR
, andNOT
operations. - Fuzzy Boolean Query Construction: Extend Boolean queries with fuzzy logic operations and modifiers like
very
andsomewhat
. - Operator Overloading: Combine queries intuitively using Python's bitwise operators (
&
,|
,~
). - Unified Evaluation Results: Utilize a single
ResultQuery
class to handle both Boolean and Fuzzy Boolean query evaluations. - TF-IDF Integration: Utilize Term Frequency-Inverse Document Frequency (TF-IDF) for sophisticated scoring in fuzzy queries.
- Extensible Design: Easily extend the framework with additional modifiers or integrate more complex scoring mechanisms.
Installation
Ensure you have Python 3.7 or higher installed.
Clone the repository:
git clone https://github.com/queelius/algebraic_search.git
cd algebraic_search
Usage
Boolean Queries
Create and evaluate strict Boolean queries to determine document matches based on term presence.
from boolean_query import BooleanQuery, ResultQuery
# Initialize BooleanQuery instances
q1 = BooleanQuery("cat dog") # same as BooleanQuery("(and cat dog)")
q2 = BooleanQuery("(or fish bird)")
q3 = ~q2
q4 = q1 & q3 # Represents "(and (and cat dog) (not (or fish bird)))"
# Example documents as sets
documents = [
{"cat", "dog"},
{"fish"},
{"bird"},
{"cat", "dog", "fish"},
{"cat", "dog", "bird"},
{"cat"},
{"dog"},
{"fish", "bird"},
{"cat", "dog", "fish", "bird"},
]
# Evaluate queries against documents
results = q4.eval(documents)
print(q4)
# Output: (and (and cat dog) (not (or fish bird)))
print(results)
# Output: ResultQuery([1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
Fuzzy Boolean Queries
Construct and evaluate fuzzy Boolean queries that consider degrees of term relevance using TF-IDF scores.
from fuzzy_boolean_query import FuzzyBooleanQuery, ResultQuery
# Sample documents
documents_text = [
"cat dog",
"fish",
"bird",
"cat dog fish",
"cat dog bird",
"cat",
"dog",
"fish bird",
"cat dog fish bird",
]
# Compute TF-IDF scores
tfidf_scores, vocabulary = FuzzyBooleanQuery.compute_tfidf(documents_text)
# Initialize FuzzyBooleanQuery with TF-IDF data
q1_fuzzy = FuzzyBooleanQuery("cat dog", tfidf_matrix=tfidf_scores, vocabulary=vocabulary)
q2_fuzzy = FuzzyBooleanQuery("(or fish bird)", tfidf_matrix=tfidf_scores, vocabulary=vocabulary)
q3_fuzzy = ~q2_fuzzy
q4_fuzzy = q1_fuzzy & q3_fuzzy # Represents "(and (and cat dog) (not (or fish bird)))"
# Evaluate queries against documents
results = q4_fuzzy.eval(documents_text)
print(q4_fuzzy) # Output: (and (and cat dog) (not (or fish bird)))
print(results) # Output: ResultQuery([...]) with float scores
API Documentation
BooleanQuery
A class for constructing and evaluating strict Boolean queries.
Initialization
BooleanQuery(query: Union[str, List] = None)
query
: A query string (e.g.,"cat dog"
,"(or fish bird)"
) or a list of tokens. Defaults to an empty query.
Methods
tokenize(query: str) -> List
: Parses the query string into a nested list structure.eval(docs: Union[List[Set[str]], Set[str]]) -> ResultQuery
: Evaluates the query against a list of documents.- Operator Overloads:
&
: Logical AND|
: Logical OR~
: Logical NOT
Example
q = BooleanQuery("(and cat dog)")
result = q.eval([{"cat", "dog"}, {"cat"}, {"dog"}, {"cat", "dog"}])
print(result) # ResultQuery([1.0, 0.0, 0.0, 1.0])
FuzzyBooleanQuery
A class for constructing and evaluating fuzzy Boolean queries with degrees of membership.
Initialization
FuzzyBooleanQuery(query: Union[str, List])
query
: A query string or list of tokens.
Methods
tokenize(query: str) -> List
: Parses the query string into a nested list structure, recognizing fuzzy modifiers.eval(docs: List[str], ranker: Callable) -> ResultQuery
: Evaluates the fuzzy query against a list of documents using some ranker, like a normalized TF-IDF score.- Operator Overloads:
&
: Logical AND (fuzzy)|
: Logical OR (fuzzy)~
: Logical NOT (fuzzy)
Example
q_fuzzy = FuzzyBooleanQuery("cat dog")
result_fuzzy = q_fuzzy.eval(docs, ranker=lambda term, doc: 1.0 if term in doc else 0.0)
print(result_fuzzy) # ResultQuery([...])
ResultQuery
Represents the results of evaluating both BooleanQuery
and FuzzyBooleanQuery
.
Attributes
scores
: A list of floats (0.0
or1.0
for Boolean queries,0.0
to1.0
for Fuzzy Boolean queries) indicating the degree of membership for each document.
Methods
- Operator Overloads:
&
: Element-wise minimum|
: Element-wise maximum~
: Element-wise complement
Example
r1 = ResultQuery([1.0, 0.0, 1.0])
r2 = ResultQuery([0.5, 1.0, 0.0]) # For BooleanQuery, these should be 0.0 or 1.0
combined = r1 & r2 # ResultQuery([0.5, 0.0, 0.0])
Note: In BooleanQuery
, scores should be strictly 0.0
or 1.0
. For FuzzyBooleanQuery
, scores range between 0.0
and 1.0
.
Formal Theory
Boolean Algebra for Queries and Results
Query Algebra (Q
)
- Elements:
Q = P(T*)
whereT*
is the set of all finite strings composed of ASCII characters.P(T*)
is the power set ofT*
. - Operations:
- AND (
&
): Intersection of two subsets. - OR (
|
): Union of two subsets. - NOT (
~
): Complement of a subset relative toT*
.
- AND (
- Constants:
- Empty Set (
{}
): Matches no documents. - Universal Set (
T*
): Matches all documents.
- Empty Set (
Result Algebra (R
)
- Elements:
R = [r_1, r_2, ..., r_n]
where eachr_i
∈ {0.0, 1.0} for Boolean queries orr_i
∈ [0.0, 1.0] for Fuzzy Boolean queries. - Operations:
- AND (
&
): Element-wise minimum. - OR (
|
): Element-wise maximum. - NOT (
~
): Element-wise complement (1.0 - r_i
).
- AND (
Homomorphism
The evaluation functions BooleanQuery.eval
and FuzzyBooleanQuery.eval
serve as homomorphisms φ: Q → R
and φ_f: Q_f → R_f
, preserving the algebraic structure:
φ(Q1 & Q2) = φ(Q1) & φ(Q2)
φ(Q1 | Q2) = φ(Q1) | φ(Q2)
φ(~Q1) = ~φ(Q1)
- Similarly for fuzzy queries.
Fuzzy Boolean Algebra for Queries and Results
Fuzzy Query Algebra (Q_f
)
- Elements:
Q_f = P(T*)
similar to Boolean queries. - Operations:
- AND (
&
): Fuzzy intersection using minimum. - OR (
|
): Fuzzy union using maximum. - NOT (
~
): Fuzzy complement (1.0 - x
). - Modifiers: Such as
very
,somewhat
, etc., to adjust degrees of membership.
- AND (
Fuzzy Result Algebra (R_f
)
- Elements:
R_f = [r_1, r_2, ..., r_n]
where eachr_i
∈ [0.0, 1.0]. - Operations:
- AND (
&
): Element-wise minimum. - OR (
|
): Element-wise maximum. - NOT (
~
): Element-wise complement (1.0 - r_i
).
- AND (
Homomorphism
The evaluation function eval
in both BooleanQuery
and FuzzyBooleanQuery
acts as a homomorphism φ: Q → R
and φ_f: Q_f → R_f
, preserving:
- Preservation of Operations:
φ(Q1 & Q2) = φ(Q1) & φ(Q2)
φ(Q1 | Q2) = φ(Q1) | φ(Q2)
φ(~Q1) = ~φ(Q1)
- Preservation of Modifiers:
φ_f(very Q) = very φ_f(Q)
φ_f(somewhat Q) = somewhat φ_f(Q)
This ensures that the logical and fuzzy structures of queries are faithfully represented in the evaluation results.
Future Enhancements
- Advanced Fuzzy Operators: Incorporate more sophisticated fuzzy logic operators and modifiers.
- Custom Scoring Mechanisms: Implement alternative scoring strategies beyond TF-IDF, such as BM25.
- Caching and Optimization: Optimize performance for large document collections through caching and efficient data structures.
- Extended Query Language: Support more complex query constructs and natural language processing features.
- Integration with Databases and Search Engines: Facilitate seamless integration with existing data storage and retrieval systems.
Contributing
Contributions are welcome! Please follow these steps:
- Fork the Repository: Click the "Fork" button on the repository page.
- Clone Your Fork:
git clone https://github.com/queelius/algebraic_search.git cd algebraic_search
- Create a New Branch:
git checkout -b feature/YourFeatureName
- Make Your Changes: Implement your feature or bug fix.
- Commit Your Changes:
git commit -m "Add your detailed commit message"
- Push to Your Fork:
git push origin feature/YourFeatureName
- Create a Pull Request: Navigate to the original repository and click "Compare & pull request."
Please ensure that your code adheres to the existing style and includes relevant tests.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file algebraic_search-0.1.1.tar.gz
.
File metadata
- Download URL: algebraic_search-0.1.1.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e3f3d2fce4f825df62e7d3b69b9f136d39562a37603d393f118fb07632cb09d2 |
|
MD5 | f0f3e578c0bb7ed19a56b32360ab6a69 |
|
BLAKE2b-256 | b5115399f353cc0430befe4c253d7dbefdb96c054a6ac5eb1ac06ccc1f44a597 |
File details
Details for the file algebraic_search-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: algebraic_search-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a48ebf3283df33b1fe8e8ba701451968bfd62e1967b16e138206ce72e5b09403 |
|
MD5 | 3e68feeeb5ffb93e4d419dfa85704b2a |
|
BLAKE2b-256 | 062369220f0892293a7e35573d0aa9ba755613042df6ba4d1a270559d1aafac1 |