Data Utilities and Processing Generalized for All CDP Instances
Project description
cdp-data
Data Utilities and Processing Generalized for All CDP Instances
Installation
Stable Release: pip install cdp-data
Development Head: pip install git+https://github.com/CouncilDataProject/cdp-data.git
Documentation
For full package documentation please visit councildataproject.github.io/cdp-data.
Quickstart
Pulling Datasets
Install basics: pip install cdp-data
Transcripts and Session Data
from cdp_data import CDPInstances, datasets
ds = datasets.get_session_dataset(
infrastructure_slug=CDPInstances.Seattle,
start_datetime="2021-01-01",
store_transcript=True,
)
Transcript Schema and Usage
It may be useful to look at our transcript model documentation.
Transcripts can be read into memory and processed as an object:
from cdp_backend.pipeline.transcript_model import Transcript
# Read the file as a Transcript object
with open("transcript.json", "r") as open_f:
transcript = Transcript.from_json(open_f.read())
# Navigate the object
for sentence in transcript.sentences:
if "clerk" in sentence.text.lower():
print(f"{sentence.index}, {sentence.start_time}: '{sentence.text}')
If you do not want to do this processing in Python or prefer to work with a DataFrame, you can convert transcripts to DataFrames like so:
from cdp_data import datasets
# assume that transcript is the same transcript as the prior code snippet
sentences = datasets.convert_transcript_to_dataframe(transcript)
You can also do this conversion (and storage of the coverted transcript) for
all transcripts in a session dataset during dataset construction with the
store_transcript_as_csv
parameter.
from cdp_data import CDPInstances, datasets
ds = datasets.get_session_dataset(
infrastructure_slug=CDPInstances.Seattle,
start_datetime="2021-01-01",
store_transcript=True,
store_transcript_as_csv=True,
)
This will store the transcript for each session as both JSON and CSV.
Voting Data
from cdp_data import CDPInstances, datasets
ds = dataset.get_vote_dataset(
infrastructure_slug=CDPInstances.Seattle,
start_datetime="2021-01-01",
)
Data Definitions and Schema
Please refer to our database schema and our database model definitions for more information on CDP generated and archived data is structured.
Saving Datasets
Because we heavily rely on our database models for database interaction,
in many cases, we default to returning the full fireo.models.Model
object
as column values.
These objects cannot be immediately stored to disk so we provide a helper to replace all model objects with their database IDs for storage.
This can be done directly if you already have a dataset you have been working with:
from cdp_data import datasets
# data should be a pandas dataframe
dataset.save_dataset(data, "data.csv")
Or this can be premptively be done during dataset construction:
from cdp_data import CDPInstances, dataset
# both get_session_dataset and get_vote_dataset
# have a `replace_py_objects` parameter
sessions = datasets.get_session_dataset(
infrastructure_slug=CDPInstances.Seattle,
replace_py_objects=True,
)
votes = datasets.get_vote_dataset(
infrastructure_slug=CDPInstances.Seattle,
replace_py_objects=True,
)
Plotting and Analysis
Install plotting support: pip install cdp-data[plot]
Ngram Usage over Time
from cdp_data import CDPInstances, keywords, plots
ngram_usage = keywords.compute_ngram_usage_history(
CDPInstances.Seattle,
start_datetime="2022-03-01",
end_datetime="2022-10-01",
)
grid = plots.plot_ngram_usage_histories(
["police", "housing", "transportation"],
ngram_usage,
lmplot_kws=dict( # extra plotting params
col="ngram",
hue="ngram",
scatter_kws={"alpha": 0.2},
aspect=1.6,
),
)
grid.savefig("seattle-keywords-over-time.png")
Development
See CONTRIBUTING.md for information related to developing the code.
MIT license
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cdp-data-0.0.10.tar.gz
.
File metadata
- Download URL: cdp-data-0.0.10.tar.gz
- Upload date:
- Size: 820.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3a8c8054123a6822e19365e38c844976367691d1260c91254499b58ab1c18c5 |
|
MD5 | 0933db2ed98e9494c514fb426dd41ed0 |
|
BLAKE2b-256 | 87ed909df6cef246c7d9389f3237a212bf9e1efe5a57084a2d54f102b4ed374c |
File details
Details for the file cdp_data-0.0.10-py3-none-any.whl
.
File metadata
- Download URL: cdp_data-0.0.10-py3-none-any.whl
- Upload date:
- Size: 815.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e85e930a32053b96c563e7b8071a126b73b9a1adb52949fdcb84f2e44f3749d |
|
MD5 | 2ff58efd6ac42ea30bbdebbc090aef82 |
|
BLAKE2b-256 | 615275aa80b497af56b6f090cc4a7d37cc4ad83f4afdd2326c73e1e519ea3901 |