Searchable pandas text extension arrays for prototyping search

These details have not been verified by PyPI

Project links

Homepage

Project description

SearchArray

SearchArray turns Pandas string columns into a term index. It alows efficient BM25 scoring of phrases and individual tokens.

Think Lucene, but as a Pandas column.

In[3]:  df['title_indexed'] = PostingsArray.index(df['title'])
        np.sort(df['title_indexed'].array.bm25('Cat'))
Out[3]: array([ 0.        ,  0.        ,  0.        , ..., 15.84568033,
                15.84568033, 15.84568033])

Installation

pip install searcharray

Motivation

Why do we treat Lucene-based, and other lexical search systems, like a special snowflake in the data stack? Many ML practitioners reach for a vector search solution, then realize they need to sprinkle in some degree of traditional lexical matching for the best solution. Indeed, in search, hybrid search of vector+lexical solutions has shown to be most performant.

Let's break down the esoteric mystique of these systems, and tame them, so they just behave like other parts of the data stack.

SearchArray creates a Pandas-centric way of creating and using a search index as just part of a Pandas array. In a sense, it builds a search engine in Pandas - to allow anyone to prototype ideas, without external systems.

You can see a full end-to-end search relevance experiment in this colab notebook

IE, take a dataframe that has a bunch of text, like movie title and overviews:

In[1]: df = pd.DataFrame({'title': titles, 'overview': overviews}, index=ids)
Out[1]:
                                        title                                           overview
374430          Black Mirror: White Christmas  This feature-length special consists of three ...
19404   The Brave-Hearted Will Take the Bride  Raj is a rich, carefree, happy-go-lucky second...
278                  The Shawshank Redemption  Framed in the 1940s for the double murder of h...
372058                             Your Name.  High schoolers Mitsuha and Taki are complete s...
238                             The Godfather  Spanning the years 1945 to 1955, a chronicle o...
...                                       ...                                                ...
65513                          They Came Back  The lives of the residents of a small French t...
65515                       The Eleventh Hour  An ex-Navy SEAL, Michael Adams, (Matthew Reese...
65521                      Pyaar Ka Punchnama  Outspoken and overly critical Nishant Agarwal ...
32767                                  Romero  Romero is a compelling and deeply moving look ...

Index the text:

In[2]: df['title_indexed'] = PostingsArray.index(df['title'])
       df

Out[2]:
                                        title                                           overview                                      title_indexed
374430          Black Mirror: White Christmas  This feature-length special consists of three ...  PostingsRow({'Black': 1, 'Mirror:': 1, 'White'...
19404   The Brave-Hearted Will Take the Bride  Raj is a rich, carefree, happy-go-lucky second...  PostingsRow({'The': 1, 'Brave-Hearted': 1, 'Wi...
278                  The Shawshank Redemption  Framed in the 1940s for the double murder of h...  PostingsRow({'The': 1, 'Shawshank': 1, 'Redemp...
372058                             Your Name.  High schoolers Mitsuha and Taki are complete s...  PostingsRow({'Your': 1, 'Name.': 1}, {'Your': ...
238                             The Godfather  Spanning the years 1945 to 1955, a chronicle o...  PostingsRow({'The': 1, 'Godfather': 1}, {'The'...
...                                       ...                                                ...                                                ...
65513                          They Came Back  The lives of the residents of a small French t...  PostingsRow({'Back': 1, 'They': 1, 'Came': 1},...
65515                       The Eleventh Hour  An ex-Navy SEAL, Michael Adams, (Matthew Reese...  PostingsRow({'The': 1, 'Hour': 1, 'Eleventh': ...
65521                      Pyaar Ka Punchnama  Outspoken and overly critical Nishant Agarwal ...  PostingsRow({'Ka': 1, 'Pyaar': 1, 'Punchnama':...
32767                                  Romero  Romero is a compelling and deeply moving look ...        PostingsRow({'Romero': 1}, {'Romero': [0]})
65534                                  Poison  Paul Braconnier and his wife Blandine only hav...        PostingsRow({'Poison': 1}, {'Poison': [0]})```

Then search, getting top N with Cat

In[3]: np.sort(df['title_indexed'].array.bm25('Cat'))
Out[3]: array([ 0.        ,  0.        ,  0.        , ..., 15.84568033,
                15.84568033, 15.84568033])

In[4]: df['title_indexed'].bm25('Cat').argsort()
Out[4]: 

array([0, 18561, 18560, ..., 15038, 19012,  4392])

And since its just pandas, we can, of course just retrieve the top matches

In[5]: df.iloc[top_n_cat[-10:]]
Out[5]:
                  title                                           overview                                      title_indexed
24106     The Black Cat  American honeymooners in Hungary are trapped i...  PostingsRow({'Black': 1, 'The': 1, 'Cat': 1}, ...
12593     Fritz the Cat  A hypocritical swinging college student cat ra...  PostingsRow({'Cat': 1, 'the': 1, 'Fritz': 1}, ...
39853  The Cat Concerto  Tom enters from stage left in white tie and ta...  PostingsRow({'The': 1, 'Cat': 1, 'Concerto': 1...
75491   The Rabbi's Cat  Based on the best-selling graphic novel by Joa...  PostingsRow({'The': 1, 'Cat': 1, "Rabbi's": 1}...
57353           Cat Run  When a sexy, high-end escort holds the key evi...  PostingsRow({'Cat': 1, 'Run': 1}, {'Cat': [0],...
25508        Cat People  Sketch artist Irena Dubrovna (Simon) and Ameri...  PostingsRow({'Cat': 1, 'People': 1}, {'Cat': [...
11694        Cat Ballou  A woman seeking revenge for her murdered fathe...  PostingsRow({'Cat': 1, 'Ballou': 1}, {'Cat': [...
25078          Cat Soup  The surreal black comedy follows Nyatta, an an...  PostingsRow({'Cat': 1, 'Soup': 1}, {'Cat': [0]...
35888        Cat Chaser  A Miami hotel owner finds danger when be becom...  PostingsRow({'Cat': 1, 'Chaser': 1}, {'Cat': [...
6217         Cat People  After years of separation, Irina (Nastassja Ki...  PostingsRow({'Cat': 1, 'People': 1}, {'Cat': [...

More use cases can be seen in the colab notebook

Goals

The overall goals are to recreate a lot of the lexical features (term / phrase search) of a search engine like Solr or Elasticsearch, but in a Pandas dataframe.

Memory efficient and fast text index

We want the index to be as memory efficient and fast at searching as possible. We want using it to have a minimal overhead.

We want you to be able to work with a reasonable dataset (1M-10M docs) relatively efficiently.

Experimentation, reranking, functionality over scalability

Instead of building for 'big data' our goal is to build for for small-data. That is, focus on capabilities and expressiveness of Pandas, over limiting functionality in favor of scalability.

To this end, the applications of searcharray will tend to be focused on experimentation and top N candidate reranking. For experimentation, we want any ideas expressed in Pandas to have a somewhat clear path / "contract" in how they'd be implemented in a classical lexical search engine. For reranking, we want to load some top N results from a base system and be able to modify them.

Make lexical search not a special snowflake in the ML stack

We know in search systems hybrid search techniques dominate. Yet often its cast in terms of a giant, weird, big data lexical search engine that looks odd to most data scientists being joined with a vector database. We want lexical search to be more approachable to data scientists and ML engineers building these systems.

Non-goals

You need to bring your own tokenization

Currently tokenization (ie text analysis) is out of scope. There's enough Python libraries that do this really well. Even exceeding what Lucene can do.

In SearchArray, a tokenizer is a function takes a string and emits a series of tokens. IE dumb, default whitespace tokenization:

def ws_tokenizer(string):
    return string.split()

And you can pass any tokenizer that matches this signature to index:

def ws_lowercase_tokenizer(string):
    return string.lower().split()

df['title_indexed'] = PostingsArray.index(df['title'], tokenizer=ws_lowercase_tokenizer)

Create your own using stemming libraries, or whatever Python functionality you want.

Use Pandas instead of function queries

Solr has its own unique function query syntax https://solr.apache.org/guide/7_7/function-queries.html. Elasticsearch has Painless.

Instead of recreating these, simply use Pandas on existing Pandas columns. Then later, if you need to implement this in Solr or Elasticsearch, attempt to recreate the functionality. Arguably what's in Solr / ES would be a subset of what you could do in Pandas.

# Calculate the number of hours into the past
df['hrs_into_past'] = (now - df['timestamp']).dt.total_seconds() / 3600

Then multiply by BM25 if you want:

df['score'] = df['title_indexed'].bm25('Cat') * df['hrs_into_past']

TODOs / Future Work / Known issues

Always more efficient
Support tokenizers with overlapping positions (ie synonyms, etc)
Add support for loading global term stats (ie doc freq) from external sources for more accurate representation
Add minimum should match to each function
Dumb vector search? Guessing other tools do this at small scale well enough.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.72

Oct 27, 2024

0.0.71

Oct 26, 2024

0.0.70

Aug 9, 2024

0.0.69

Aug 5, 2024

0.0.68

Aug 1, 2024

0.0.67

Jul 30, 2024

0.0.66

Jul 30, 2024

0.0.65

Jul 14, 2024

0.0.64

Jul 13, 2024

0.0.63

Jul 13, 2024

0.0.62

Jul 12, 2024

0.0.61

Jul 8, 2024

0.0.60

Jun 24, 2024

0.0.59

Jun 22, 2024

0.0.58

Jun 18, 2024

0.0.57

Jun 3, 2024

0.0.56

May 16, 2024

0.0.55

May 14, 2024

0.0.54

May 7, 2024

0.0.53

May 4, 2024

0.0.52

May 3, 2024

0.0.51

Apr 28, 2024

0.0.50

Apr 21, 2024

0.0.49

Apr 20, 2024

0.0.48

Apr 19, 2024

0.0.47

Apr 19, 2024

0.0.46

Mar 17, 2024

0.0.45

Mar 16, 2024

0.0.44

Mar 15, 2024

0.0.43

Mar 14, 2024

0.0.42

Mar 14, 2024

0.0.42a0 pre-release

Mar 14, 2024

0.0.41

Mar 14, 2024

0.0.40

Mar 13, 2024

0.0.39

Mar 6, 2024

0.0.38

Feb 19, 2024

0.0.37

Feb 18, 2024

0.0.36

Feb 11, 2024

0.0.34

Jan 15, 2024

0.0.33

Jan 8, 2024

0.0.32

Jan 4, 2024

0.0.31

Jan 4, 2024

0.0.30

Jan 4, 2024

0.0.29

Jan 3, 2024

0.0.28

Jan 2, 2024

0.0.27

Dec 29, 2023

0.0.26

Dec 28, 2023

0.0.25

Dec 28, 2023

0.0.24

Dec 28, 2023

0.0.23

Dec 28, 2023

0.0.22

Dec 27, 2023

0.0.21

Dec 27, 2023

0.0.20

Dec 26, 2023

0.0.19

Dec 25, 2023

0.0.18

Dec 24, 2023

0.0.17

Dec 23, 2023

0.0.16

Dec 23, 2023

0.0.15

Dec 13, 2023

0.0.14

Dec 11, 2023

0.0.14a0 pre-release

Dec 12, 2023

0.0.13

Dec 11, 2023

0.0.13b0 pre-release

Dec 11, 2023

0.0.13a0 pre-release

Dec 11, 2023

This version

0.0.12

Dec 3, 2023

0.0.11

Dec 1, 2023

0.0.10

Nov 22, 2023

0.0.9

Nov 22, 2023

0.0.8

Nov 22, 2023

0.0.7

Nov 17, 2023

0.0.6

Nov 17, 2023

0.0.5

Nov 16, 2023

0.0.4

Nov 16, 2023

0.0.3

Nov 14, 2023

0.0.2

Nov 13, 2023

0.0.1.post0

Nov 13, 2023

0.0.1

Nov 13, 2023

0.0.1a0 pre-release

Nov 13, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

searcharray-0.0.12.tar.gz (26.1 kB view details)

Uploaded Dec 3, 2023 Source

Built Distribution

searcharray-0.0.12-py3-none-any.whl (15.0 kB view details)

Uploaded Dec 3, 2023 Python 3

File details

Details for the file searcharray-0.0.12.tar.gz.

File metadata

Download URL: searcharray-0.0.12.tar.gz
Upload date: Dec 3, 2023
Size: 26.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for searcharray-0.0.12.tar.gz
Algorithm	Hash digest
SHA256	`07f3d5653be6225e32f8d8f58f8a94e19cb87eeb3287b25689651fde6e407517`
MD5	`f26d386125e5e6807e22fc209dbcc645`
BLAKE2b-256	`ebad8f15080bebeaffaa4a69aec7cf2c118fcc6693db9a86c255f7a866dc1cfa`

See more details on using hashes here.

File details

Details for the file searcharray-0.0.12-py3-none-any.whl.

File metadata

Download URL: searcharray-0.0.12-py3-none-any.whl
Upload date: Dec 3, 2023
Size: 15.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for searcharray-0.0.12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`716af84a1ac1a7b513ec9f091583d7f192d6d30f7ef7685dbc2ab03e99ac6dea`
MD5	`4ee881708384cabcff4ac3ba325c76df`
BLAKE2b-256	`1a439e6290d92974088fb567b64f11ce1ebfe9edf6dc01e19b64552b5dcd0912`