A portal to data sources

Project description

cosmodata

A portal to data sources for cosmograph

To install: pip install cosmodata

Datasets Overview

Introduction

This repository contains datasets for various projects, each prepared for visualization and analysis using Cosmograph. The raw data consists of structured information from sources like academic publications, GitHub repositories, political debates, and Spotify playlists. The prepared datasets feature embeddings and 2D projections that enable scatter and force-directed graph visualizations.

Dataset Descriptions

EuroVis Dataset

Raw Data: Academic publications metadata from the EuroVis conference, including titles, abstracts, authors, and awards.
Prepared Data: merged_artifacts.parquet (5599 rows, 18 columns)
- Potential columns for visualization:
  - X & Y Coordinates: x, y
  - Point Size: n_tokens (number of tokens in the abstract)
  - Color: Cluster labels (cluster_05, cluster_08, etc.)
  - Label: title
- More:
  - visualization notebook
  - data prep module eurovis.py

Harris vs Trump Debate Dataset

Raw Data: Transcript of a political debate between Kamala Harris and Donald Trump.
Prepared Data: harris_vs_trump_debate_with_extras.parquet (1,141 rows, 21 columns)
- Potential columns for visualization:
  - X & Y Coordinates: tsne__x, tsne__y, pca__x, pca__y
  - Point Size: certainty
  - Color: speaker_color
  - Label: text
- More:
  - visualization notebook

Spotify Playlists Dataset

Raw Data: Metadata on popular songs from various playlists, including holiday songs and the greatest 500 songs.
Prepared Data: holiday_songs_spotify_with_embeddings.parquet (167 rows, 27 columns)
- Potential columns for visualization:
  - X & Y Coordinates: umap_x, umap_y, tsne_x, tsne_y
  - Point Size: popularity
  - Color: genre (derived from playlist)
  - Label: track_name
- More:
  - visualization notebook

Quotes Dataset

Raw Data: Collection of 1,638 famous quotes.
Prepared Data: micheleriva_1638_quotes_planar_embeddings.parquet (1,638 rows, 3 columns)
- Potential columns for visualization:
  - X & Y Coordinates: x, y
  - Label: quote
- More:
  - visualization notebook

Prompt Injections Dataset

Raw Data: Data related to prompt injection attacks and defenses.
Prepared Data: prompt_injection_w_umap_embeddings.tsv (662 rows, 6 columns)
- Potential columns for visualization:
  - X & Y Coordinates: x, y
  - Point Size: size
  - Color: label
  - Label: text
- More:
  - data prep module prompt_injections.py
  - visualization notebook

LMSys Chat Conversations Dataset

Raw Data: Conversations from AI chat systems.
Prepared Data: lmsys_with_planar_embeddings_pca500.parquet (2,835,490 rows, 38 columns)
- Potential columns for visualization:
  - X & Y Coordinates: x_umap, y_umap
  - Point Size: num_of_tokens
  - Color: model
  - Label: content
- Related code file: lmsys_ai_conversations.py

HCP Publications Dataset

Raw Data: Human Connectome Project (HCP) publications and citation networks.
Prepared Data: aggregate_titles_embeddings_umap_2d_with_info.parquet (340,855 rows, 9 columns)
- Potential columns for visualization:
  - X & Y Coordinates: x, y
  - Point Size: n_cits (citation count)
  - Color: main_field (research domain)
  - Label: title
- Related code file: hcp.py

GitHub Repositories Dataset

Raw Data: GitHub repository metadata including stars, forks, programming languages, and repository descriptions.
Prepared Data: github_repo_for_cosmos.parquet (3,065,063 rows, 28 columns)
- Potential columns for visualization:
  - X & Y Coordinates: x, y
  - Point Size: stars (star count), forks
  - Color: primaryLanguage
  - Label: nameWithOwner
- Related code file: github_repos.py

Usage Instructions

Load the prepared .parquet files into a Pandas DataFrame.
Use Cosmograph or another visualization tool to create scatter or force-directed plots.
Customize the x/y coordinates, size, color, and labels based on your analysis needs.

Acknowledgments

The data has been curated and prepared by Thor Whalen and contributors.
Data sources include Kaggle, Hugging Face, GitHub, and various public datasets.

For further details, please refer to the individual dataset documentation or the linked preparation scripts.

Notebook Utilities

This section:

Explains the why (seamless Colab/local experience)
Shows quick examples (copy-pasteable)
Highlights key features (caching, auto-detection)
Includes practical workflows (typical notebook usage)
Provides cache management (helpful for power users)

cosmodata includes utilities to make working with data in notebooks (especially Colab) seamless.

`ensure_installed` - Lazy Dependency Management

Install packages only when needed, with smart local/Colab detection. Note: Most of the time you can just do %pip install -q ...packages in your notebook, but if you want to ask permission to the user first (which I like doing), or need to ensure installation from python itself, this could help.

from cosmodata import ensure_installed

# Simple: space-separated package names
ensure_installed('graze tabled pandas')

# With version requirements
ensure_installed('graze>=0.1.0 tabled pandas<2.0')

# Handle import/pip name mismatches
ensure_installed('PIL cv2', pip_names={'PIL': 'Pillow', 'cv2': 'opencv-python'})

Behavior:

In Colab: Auto-installs missing packages silently
Locally: Shows what will be installed and asks for confirmation (default: Yes)
Smart: Only installs if package is missing or version doesn't satisfy requirements

`acquire_data` - Unified Data Loading with Caching

Load data from URLs or files with automatic caching. Works seamlessly in Colab (Google Drive) and locally.

from cosmodata import acquire_data

# Load CSV from URL (cached automatically)
df = acquire_data('https://example.com/data.csv')

# Custom getter for APIs
data = acquire_data(
    'https://api.example.com/endpoint',
    getter=lambda url: requests.get(url).json(),
    cache_key='api_data'
)

# Force refresh cached data
df = acquire_data(url, refresh=True)

# Custom cache location
df = acquire_data(url, cache_dir='/path/to/cache')

Features:

Auto-caching:
- Colab: Saves to Google Drive (MyDrive/.colab_cache) for persistence across sessions
- Local: Saves to ~/.local/share/cosmodata/datasets
Smart getters: Auto-detects appropriate loader (graze → tabled → requests)
Refresh support: Bypass cache with refresh=True
Format support: Handles CSV, JSON, Excel, Parquet, etc. (via tabled)

Typical notebook workflow:

# Cell 1: Setup
!pip install cosmodata
from cosmodata import ensure_installed, acquire_data

ensure_installed('graze tabled pandas')

# Cell 2: Load data (fast on subsequent runs)
df = acquire_data('https://example.com/large_dataset.csv', cache_key='my_dataset')

# Cell 3: Your analysis
df.head()

Cache management:

# See where data is cached
import os
from pathlib import Path

# In Colab
cache_dir = Path('/content/drive/MyDrive/.colab_cache')

# Locally
cache_dir = Path('~/.local/share/cosmodata/datasets').expanduser()

# List cached files
list(cache_dir.glob('*.pkl'))

# Clear specific cache
os.remove(cache_dir / 'my_dataset.pkl')

Pro tip: Combine both utilities for the smoothest notebook experience:

from cosmodata import ensure_installed, acquire_data

# One-time setup
ensure_installed('graze tabled requests')

# Now your data loading "just works" with caching
df1 = acquire_data('https://example.com/data.csv')
df2 = acquire_data('local_file.parquet')
api_data = acquire_data(
    'https://api.example.com/data',
    getter=lambda u: requests.get(u).json()
)

Project details

Release history Release notifications | RSS feed

0.0.5

Oct 16, 2025

0.0.4

Oct 16, 2025

This version

0.0.2

Oct 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cosmodata-0.0.2.tar.gz (28.7 kB view details)

Uploaded Oct 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cosmodata-0.0.2-py3-none-any.whl (27.4 kB view details)

Uploaded Oct 18, 2025 Python 3

File details

Details for the file cosmodata-0.0.2.tar.gz.

File metadata

Download URL: cosmodata-0.0.2.tar.gz
Upload date: Oct 18, 2025
Size: 28.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for cosmodata-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`0523ba626d5b59fdbe0e060e0ba9272ef48f9a9d86b874fbde3f9739eb389c75`
MD5	`d780203a0081759d77e4c80215f67a75`
BLAKE2b-256	`307b702d1783c5faa0f36e0c3b1866ccd9be75c63dcd09a18a0b51fe05948be2`

See more details on using hashes here.

File details

Details for the file cosmodata-0.0.2-py3-none-any.whl.

File metadata

Download URL: cosmodata-0.0.2-py3-none-any.whl
Upload date: Oct 18, 2025
Size: 27.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for cosmodata-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0fe1f76d2cba76a4bb3e148a3399b20f692f54a14447a7767f53f7cfe1f20f77`
MD5	`64915ab45d82e14c825c03ba971fe18d`
BLAKE2b-256	`63b4bbb93d5d6c6f03f4a1c2da30b43e272973bc27d3c13ebcc9c0828cc7dc4f`

See more details on using hashes here.

cosmodata 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

cosmodata

Datasets Overview

Introduction

Dataset Descriptions

EuroVis Dataset

Harris vs Trump Debate Dataset

Spotify Playlists Dataset

Quotes Dataset

Prompt Injections Dataset

LMSys Chat Conversations Dataset

HCP Publications Dataset

GitHub Repositories Dataset

Usage Instructions

Acknowledgments

Notebook Utilities

ensure_installed - Lazy Dependency Management

acquire_data - Unified Data Loading with Caching

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`ensure_installed` - Lazy Dependency Management

`acquire_data` - Unified Data Loading with Caching