An elasticsearch wrapper that allows to query ES indices

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Useful functions wrapping around Elasticsearch

Connect to the server with a read-only account

Get access to the indices

Dolma index: https://forms.gle/gQN4nP4HHYGwXAis9
Other indices: https://forms.gle/yMz7uTFhd1dKNYTk7

from wimbd.es import es_init
es = es_init()

Find out which indices exist (with other information about the index)

from wimbd.es import get_indices

# This returns all indices, along with their total document counts.
print(get_indices())

# This also returns elasticsearch mapping information.
print(get_indices(return_mapping=True))

Note that the get_indices function won't work with the access key we provide, since it limits the access to the ES index. However, you can find the names of the relevant indices below.

At the moment, this will return the following indices:

{'re_pile': {'docs.count': '211036967'},
 're_laion2b_multi': {'docs.count': '2248498161'}
 'openwebtext': {'docs.count': '8013769'},
 're_laion2b-en-1': {'docs.count': '1161075864'},
 're_laion2b-en-2': {'docs.count': '1161076588'},
 'c4': {'docs.count': '1074273501'},
 're_laion2b_nolang': {'docs.count': '1271703630'},
 're_oscar': {'docs.count': '431992659'}}

Different Indices

We have 3 different indices that we can make publicly available. Each contain different corpora:

The Pile, OpenWebText, C4 and Oscar (re_pile, openwebtext, c4, and re_oscar)
RedPajamav1 (redpajama-split)
Dolma (docs_v1.5_2023-11-02)

Indices Mapping

{
    'mappings': {
        'dynamic': 'false',
        'properties': {
            'date': {
                'type': 'date'
            },
            'subset': {
                'type': 'keyword', 
                'ignore_above': 256
            },
            'text': {
                'type': 'text'
            },
            'url': {
                'type': 'text'
            }
        }
    }
}

Search over one index

Search for one or more terms, or sequences of terms (phrases). When you search for a sequence of terms, their exact order is matched.

from wimbd.es import count_documents_containing_phrases

# Count the number of documents containing the term "legal".
count_documents_containing_phrases("test-index", "legal")  # single term

# Count the number of documents containing the term "legal" OR the term "license".
count_documents_containing_phrases("test-index", ["legal", "license"])  # list of terms

# Count the number of documents containing the phrase "terms of use" OR "legally binding".
count_documents_containing_phrases("test-index", ["terms of use", "legally binding"])  # list of word sequences

# Count the number of documents containing both `winter` AND `spring` in the text.
count_documents_containing_phrases("test-index", ["winter", "spring"], all_phrases=True)

If you want to actually inspect the documents, you can use get_documents_containing_phrases with the same queries as above instead.

from wimbd.es import get_documents_containing_phrases

# Get documents containing the term "legal".
get_documents_containing_phrases("test-index", "legal")  # single term

# Specify the number of documents to return using `num_documents`. Default is 10.
# Get documents containing the term "legal" OR the term "license".
get_documents_containing_phrases("test-index", ["legal", "license"], num_documents=50)  # list of terms

# Get documents containing the phrase "terms of use" OR "legally binding".
get_documents_containing_phrases("test-index", ["terms of use", "legally binding"])  # list of word sequences

# Get documents containing both `winter` AND `spring` in the text.
get_documents_containing_phrases("test-index", ["winter", "spring"], all_phrases=True)

Get total number of a term's occurrences (as opposed to document counts)

from wimbd.es import count_total_occurrences_of_unigrams

count_total_occurrences_of_unigrams("test-index", ["legal", "license"])

Search over multiple indices

Because LAION has more documents than can fit into one Elastic Search index, it is split over multiple indices. Fortunately, you can query more than one index at a time.

from wimbd.es import count_documents_containing_phrases

count_documents_containing_phrases("re_laion2b-en-*", "the woman")

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Aug 31, 2024

0.1.0

Aug 29, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wimbd-0.1.1.tar.gz (14.8 kB view details)

Uploaded Aug 31, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wimbd-0.1.1-py3-none-any.whl (12.4 kB view details)

Uploaded Aug 31, 2024 Python 3

File details

Details for the file wimbd-0.1.1.tar.gz.

File metadata

Download URL: wimbd-0.1.1.tar.gz
Upload date: Aug 31, 2024
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for wimbd-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`d97abf7437745e2e563d712cd1af10e3a4c434a368f3d74f952e00dc650fff68`
MD5	`75b2915f0a461bad9c3061e6f247cea8`
BLAKE2b-256	`c81244e0f1aae9bfcaeb6b33bd5d04f8713d488abea722b39516c394a6d652f0`

See more details on using hashes here.

File details

Details for the file wimbd-0.1.1-py3-none-any.whl.

File metadata

Download URL: wimbd-0.1.1-py3-none-any.whl
Upload date: Aug 31, 2024
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for wimbd-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da6dd6ecb8a6580238e8e9dbb13ccde97d519148bd742c2c7af10b33ebb0dbed`
MD5	`c861f807fa75bc68e3fe9b4cf2a6045d`
BLAKE2b-256	`390327b5fd50b953b2978699ff100048c61802d73cfb55c4cdd9371640e41a2b`

See more details on using hashes here.

wimbd 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Useful functions wrapping around Elasticsearch

Connect to the server with a read-only account

Get access to the indices

Find out which indices exist (with other information about the index)

Different Indices

Indices Mapping

Search over one index

Get total number of a term's occurrences (as opposed to document counts)

Search over multiple indices

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes