No project description provided

These details have not been verified by PyPI

Project description

cpr-sdk

Internal library for persistent access to text data.

Warning This library is heavily under construction and doesn't work with any of our open data yet. We're working on making it usable for anyone.

Documents and Datasets

The base document model of this library is BaseDocument, which contains only the metadata fields that are used in the parser.

Loading from Huggingface Hub (recommended)

The Dataset class is automatically configured with the Huggingface repos we use. You can optionally provide a document limit, a dataset version, and override the repo that the data is loaded from.

If the repository is private you must provide a user access token, either in your environment as HUGGINGFACE_TOKEN, or as an argument to from_huggingface.

from cpr_sdk.models import Dataset, GSTDocument

dataset = Dataset(GSTDocument).from_huggingface(
    version="d8363af072d7e0f87ec281dd5084fb3d3f4583a9", # commit hash, optional
    limit=1000,
    token="my-huggingface-token", # required for private repos if not in env
)

The following flag is used for the passage level and flat dataset.

dataset = Dataset(
    document_model=BaseDocument
).from_huggingface(
    dataset_name="ClimatePolicyRadar/passage-level-flat-dataset",
    passage_level_and_flat=True
)

Loading from local storage or s3

# document_id is also the filename stem

document = BaseDocument.load_from_local(folder_path="path/to/data/", document_id="document_1234")

document = BaseDocument.load_from_remote(dataset_key"s3://cpr-data", document_id="document_1234")

To manage metadata, documents need to be loaded into a Dataset object.

from cpr_sdk.models import Dataset, CPRDocument, GSTDocument

dataset = Dataset().load_from_local("path/to/data", limit=1000)
assert all([isinstance(document, BaseDocument) for document in dataset])

dataset_with_metadata = dataset.add_metadata(
    target_model=CPRDocument,
    metadata_csv="path/to/metadata.csv",
)

assert all([isinstance(document, CPRDocument) for document in dataset_with_metadata])

Datasets have a number of methods for filtering and accessing documents.

len(dataset)
>>> 1000

dataset[0]
>>> CPRDocument(...)

# Filtering
dataset.filter("document_id", "1234")
>>> Dataset()

dataset.filter_by_language("en")
>>> Dataset()

# Filtering using a function
dataset.filter("document_id", lambda x: x in ["1234", "5678"])
>>> Dataset()

Search

This library can also be used to run searches against CPR documents and passages in Vespa.

from src.cpr_sdk.search_adaptors import VespaSearchAdapter
from src.cpr_sdk.models.search import SearchParameters

adaptor = VespaSearchAdapter(instance_url="YOUR_INSTANCE_URL")

request = SearchParameters(query_string="forest fires")

response = adaptor.search(request)

The above example will return a SearchResponse object, which lists some basic information about the request, and the results, arranged as a list of Families, which each contain relevant Documents and/or Passages.

Sorting

By default, results are sorted by relevance, but can be sorted by date, or name, eg

request = SearchParameters(
    query_string="forest fires",
    sort_by="date",
    sort_order="descending",
)

Filters

Matching documents can also be filtered by keyword field, and by publication date

request = SearchParameters(
    query_string="forest fires",
    filters={
        "language": ["English", "French"],
        "category": ["Executive"],
    },
    year_range=(2010, 2020)
)

Search within families or documents

A subset of families or documents can be retrieved for search using their ids

request = SearchParameters(
    query_string="forest fires",
    family_ids=["CCLW.family.10121.0", "CCLW.family.4980.0"],
)

request = SearchParameters(
    query_string="forest fires",
    document_ids=["CCLW.executive.10121.4637", "CCLW.legislative.4980.1745"],
)

Types of query

The default search approach uses a nearest neighbour search ranking.

Its also possible to search for exact matches instead:

request = SearchParameters(
    query_string="forest fires",
    exact_match=True,
)

Or to ignore the query string and search the whole database instead:

request = SearchParameters(
    year_range=(2020, 2024),
    sort_by="date",
    sort_order="descending",
)

Continuing results

The response objects include continuation tokens, which can be used to get more results.

For the next selection of families:

response = adaptor.search(SearchParameters(query_string="forest fires"))

follow_up_request = SearchParameters(
    query_string="forest fires"
    continuation_tokens=[response.continuation_token],

)
follow_up_response = adaptor.search(follow_up_request)

It is also possible to get more hits within families by using the continuation token on the family object, rather than at the responses root

Note that this_continuation_token is used to mark the current continuation of the families, so getting more passages for a family after getting more families would look like this:

follow_up_response = adaptor.search(follow_up_request)

this_token = follow_up_response.this_continuation_token
passage_token = follow_up_response.families[0].continuation_token

follow_up_request = SearchParameters(
    query_string="forest fires"
    continuation_tokens=[this_token, passage_token],
)

Get a specific document

Users can also fetch single documents directly from Vespa, by document ID

adaptor.get_by_id(document_id="id:YOUR_NAMESPACE:YOUR_SCHEMA_NAME::SOME_DOCUMENT_ID")

All of the above search functionality assumes that a valid set of vespa credentials is available in ~/.vespa, or in a directory supplied to the VespaSearchAdapter constructor directly. See the docs for more information on how vespa expects credentials.

Test setup

Some tests rely on a local running instance of vespa.

This requires the vespa cli to be installed.

Setup can then be run with:

poetry install --all-extras --with dev
poetry shell
make vespa_dev_setup
make test

Alternatively, to only run non-vespa tests:

make test_not_vespa

For clean up:

make vespa_dev_down

Filtering for concept counts

The cpr_sdk incorporates via SearchParameters and a build clause in the YqlBuilder class the ability to perform complex queries on the agregated concept counts that are held in the family index.

These counts refer to the total number of matches for a concept in a family document. For example concept Q123 may have 100 matches because the concept for example forestry is mentioned in text 100 times.

So what queries can we perform?

An extensive set of tests have been written for the concept count filters, these display the full capabilities of the filtering functionality: tests/test_search_adaptors.py:test_vespa_search_adaptor__concept_counts

This shows that we can:

Filter for documents with a match for a concept.
Filter for documents that don't have a match for a concept.
Filter for documents with a match for a concept, with a specific count (e.g. > 10 matches)
Filter for documents with a count of any concept (e.g. > 10 matches)
Stack filters via an AND operator, e.g. 100 matches for Q123 AND 10 matches for Q456.
Order results in ascending or descending order such that documents with the most/least matches appear first in search.

Release Flow:

Make updates to the package.
Bump the package version in the cpr_sdk/version.py module.
Make a PR.
- In CI/CD we will check that the version is greater than the latest release.
Merge.
Tag a release manually in github with a version that matches the latest on main that you just merged.
- In CI/CD we will check that the latest release matches the versions defined in code.
Check in pypi.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

4.3.0

Jan 29, 2026

4.2.2

Jan 29, 2026

4.2.1

Nov 19, 2025

4.2.0

Nov 5, 2025

4.1.0

Nov 3, 2025

4.0.1

Oct 29, 2025

4.0.0

Oct 29, 2025

3.0.0

Oct 28, 2025

1.24.0

Oct 14, 2025

1.23.4

Sep 19, 2025

1.23.3

Sep 17, 2025

1.22.1

Sep 9, 2025

1.20.1

Jun 10, 2025

1.19.1

May 27, 2025

1.19.0

May 13, 2025

1.18.4

May 13, 2025

1.18.3

May 13, 2025

1.18.2

Apr 2, 2025

1.17.0

Feb 26, 2025

1.16.4

Feb 26, 2025

1.16.3

Feb 19, 2025

1.16.2

Feb 17, 2025

1.16.0

Jan 20, 2025

1.15.0

Jan 20, 2025

This version

1.14.0

Jan 16, 2025

1.13.0

Jan 14, 2025

1.12.0

Jan 7, 2025

1.11.1

Jan 6, 2025

1.11.0

Dec 16, 2024

1.10.2

Dec 16, 2024

1.10.1

Dec 16, 2024

1.10.0

Dec 16, 2024

1.9.8

Dec 10, 2024

1.9.7

Dec 9, 2024

1.9.6

Dec 3, 2024

1.9.5

Nov 28, 2024

1.9.4

Nov 28, 2024

1.9.3

Oct 30, 2024

1.9.2

Oct 30, 2024

1.9.1

Oct 9, 2024

1.9.0

Oct 8, 2024

1.8.0

Oct 7, 2024

1.7.1

Sep 30, 2024

1.7.0

Sep 19, 2024

1.6.1

Sep 18, 2024

1.5.4

Sep 17, 2024

1.5.0

Sep 17, 2024

1.4.4

Sep 16, 2024

1.4.3

Sep 16, 2024

1.4.2

Sep 12, 2024

1.4.1

Sep 11, 2024

1.4.0

Sep 11, 2024

1.3.13

Sep 9, 2024

1.3.12

Sep 3, 2024

1.3.11

Aug 28, 2024

1.3.10

Aug 28, 2024

1.1.9

Jul 23, 2024

1.1.8 yanked

Jul 23, 2024

1.1.6

Jun 26, 2024

1.1.5

Jun 3, 2024

1.1.4

May 29, 2024

1.1.2

Apr 25, 2024

1.1.0

Apr 11, 2024

1.0.2

Apr 8, 2024

0.5.6

Apr 3, 2024

0.1.1

Apr 2, 2024

0.0.0

Apr 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cpr_sdk-1.14.0.tar.gz (60.2 kB view details)

Uploaded Jan 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cpr_sdk-1.14.0-py3-none-any.whl (61.5 kB view details)

Uploaded Jan 16, 2025 Python 3

File details

Details for the file cpr_sdk-1.14.0.tar.gz.

File metadata

Download URL: cpr_sdk-1.14.0.tar.gz
Upload date: Jan 16, 2025
Size: 60.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.12.3 Linux/6.8.0-1017-azure

File hashes

Hashes for cpr_sdk-1.14.0.tar.gz
Algorithm	Hash digest
SHA256	`154b070f53c7b9e33771858964a51a994bfc67a993b37f841e6c515824d41d23`
MD5	`d465b1f2f522454a0403e6dc02298f59`
BLAKE2b-256	`ca36fde1ecb5d6067ca94af00b4493c547b44538e3f52cac8375a1999b0ed617`

See more details on using hashes here.

File details

Details for the file cpr_sdk-1.14.0-py3-none-any.whl.

File metadata

Download URL: cpr_sdk-1.14.0-py3-none-any.whl
Upload date: Jan 16, 2025
Size: 61.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.12.3 Linux/6.8.0-1017-azure

File hashes

Hashes for cpr_sdk-1.14.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ba76453c83a5cedb2043838040210554616b257d23806bf93173b1c7cf1ef29d`
MD5	`a31e8bc73aeb158079323bb5293bfa56`
BLAKE2b-256	`f3402b47e0a559dab84a2e0ba5e7eeff3055419305c50c7cdb290377b33fd95b`

See more details on using hashes here.

cpr_sdk 1.14.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

cpr-sdk

Documents and Datasets

Loading from Huggingface Hub (recommended)

Loading from local storage or s3

Search

Sorting

Filters

Search within families or documents

Types of query

Continuing results

Get a specific document

Test setup

Filtering for concept counts

Release Flow:

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes