Skip to main content

Data Utilities and Processing Generalized for All CDP Instances

Project description

cdp-data

Build Status Documentation

Data Utilities and Processing Generalized for All CDP Instances


Keywords over time in Seattle, Portland, and Oakland

Installation

Stable Release: pip install cdp-data
Development Head: pip install git+https://github.com/CouncilDataProject/cdp-data.git

Documentation

For full package documentation please visit councildataproject.github.io/cdp-data.

Quickstart

Pulling Datasets

Install basics: pip install cdp-data

Transcripts and Session Data

from cdp_data import CDPInstances, datasets

ds = datasets.get_session_dataset(
    infrastructure_slug=CDPInstances.Seattle,
    start_datetime="2021-01-01",
    store_transcript=True,
)
Transcript Schema and Usage

It may be useful to look at our transcript model documentation.

Transcripts can be read into memory and processed as an object:

from cdp_backend.pipeline.transcript_model import Transcript

# Read the file as a Transcript object
with open("transcript.json", "r") as open_f:
    transcript = Transcript.from_json(open_f.read())

# Navigate the object
for sentence in transcript.sentences:
    if "clerk" in sentence.text.lower():
        print(f"{sentence.index}, {sentence.start_time}: '{sentence.text}')

If you do not want to do this processing in Python or prefer to work with a DataFrame, you can convert transcripts to DataFrames like so:

from cdp_data import datasets

# assume that transcript is the same transcript as the prior code snippet
sentences = datasets.convert_transcript_to_dataframe(transcript)

You can also do this conversion (and storage of the coverted transcript) for all transcripts in a session dataset during dataset construction with the store_transcript_as_csv parameter.

from cdp_data import CDPInstances, datasets

ds = datasets.get_session_dataset(
    infrastructure_slug=CDPInstances.Seattle,
    start_datetime="2021-01-01",
    store_transcript=True,
    store_transcript_as_csv=True,
)

This will store the transcript for each session as both JSON and CSV.

Voting Data

from cdp_data import CDPInstances, datasets

ds = dataset.get_vote_dataset(
    infrastructure_slug=CDPInstances.Seattle,
    start_datetime="2021-01-01",
)

Data Definitions and Schema

Please refer to our database schema and our database model definitions for more information on CDP generated and archived data is structured.

Saving Datasets

Because we heavily rely on our database models for database interaction, in many cases, we default to returning the full fireo.models.Model object as column values.

These objects cannot be immediately stored to disk so we provide a helper to replace all model objects with their database IDs for storage.

This can be done directly if you already have a dataset you have been working with:

from cdp_data import datasets

# data should be a pandas dataframe
dataset.save_dataset(data, "data.csv")

Or this can be premptively be done during dataset construction:

from cdp_data import CDPInstances, dataset

# both get_session_dataset and get_vote_dataset
# have a `replace_py_objects` parameter
sessions = datasets.get_session_dataset(
    infrastructure_slug=CDPInstances.Seattle,
    replace_py_objects=True,
)

votes = datasets.get_vote_dataset(
    infrastructure_slug=CDPInstances.Seattle,
    replace_py_objects=True,
)

Plotting and Analysis

Install plotting support: pip install cdp-data[plot]

Ngram Usage over Time

from cdp_data import CDPInstances, keywords, plots

ngram_usage = keywords.compute_ngram_usage_history(
    CDPInstances.Seattle,
    start_datetime="2022-03-01",
    end_datetime="2022-10-01",
)
grid = plots.plot_ngram_usage_histories(
    ["police", "housing", "transportation"],
    ngram_usage,
    lmplot_kws=dict(  # extra plotting params
        col="ngram",
        hue="ngram",
        scatter_kws={"alpha": 0.2},
        aspect=1.6,
    ),
)
grid.savefig("seattle-keywords-over-time.png")

Seattle keyword usage over time

Development

See CONTRIBUTING.md for information related to developing the code.

MIT license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdp-data-0.0.10.tar.gz (820.2 kB view details)

Uploaded Source

Built Distribution

cdp_data-0.0.10-py3-none-any.whl (815.5 kB view details)

Uploaded Python 3

File details

Details for the file cdp-data-0.0.10.tar.gz.

File metadata

  • Download URL: cdp-data-0.0.10.tar.gz
  • Upload date:
  • Size: 820.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for cdp-data-0.0.10.tar.gz
Algorithm Hash digest
SHA256 c3a8c8054123a6822e19365e38c844976367691d1260c91254499b58ab1c18c5
MD5 0933db2ed98e9494c514fb426dd41ed0
BLAKE2b-256 87ed909df6cef246c7d9389f3237a212bf9e1efe5a57084a2d54f102b4ed374c

See more details on using hashes here.

File details

Details for the file cdp_data-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: cdp_data-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 815.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for cdp_data-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 3e85e930a32053b96c563e7b8071a126b73b9a1adb52949fdcb84f2e44f3749d
MD5 2ff58efd6ac42ea30bbdebbc090aef82
BLAKE2b-256 615275aa80b497af56b6f090cc4a7d37cc4ad83f4afdd2326c73e1e519ea3901

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page