A portal to data sources
Project description
cosmodata
A portal to data sources for cosmograph
To install: pip install cosmodata
Datasets Overview
Introduction
This repository contains datasets for various projects, each prepared for visualization and analysis using Cosmograph. The raw data consists of structured information from sources like academic publications, GitHub repositories, political debates, and Spotify playlists. The prepared datasets feature embeddings and 2D projections that enable scatter and force-directed graph visualizations.
Dataset Descriptions
EuroVis Dataset
- Raw Data: Academic publications metadata from the EuroVis conference, including titles, abstracts, authors, and awards.
- Prepared Data: merged_artifacts.parquet (5599 rows, 18 columns)
- Potential columns for visualization:
- X & Y Coordinates:
x,y - Point Size:
n_tokens(number of tokens in the abstract) - Color: Cluster labels (
cluster_05,cluster_08, etc.) - Label:
title
- X & Y Coordinates:
- Related code file: eurovis.py
- Potential columns for visualization:
GitHub Repositories Dataset
- Raw Data: GitHub repository metadata including stars, forks, programming languages, and repository descriptions.
- Prepared Data: github_repo_for_cosmos.parquet (3,065,063 rows, 28 columns)
- Potential columns for visualization:
- X & Y Coordinates:
x,y - Point Size:
stars(star count),forks - Color:
primaryLanguage - Label:
nameWithOwner
- X & Y Coordinates:
- Related code file: github_repos.py
- Potential columns for visualization:
HCP Publications Dataset
- Raw Data: Human Connectome Project (HCP) publications and citation networks.
- Prepared Data: aggregate_titles_embeddings_umap_2d_with_info.parquet (340,855 rows, 9 columns)
- Potential columns for visualization:
- X & Y Coordinates:
x,y - Point Size:
n_cits(citation count) - Color:
main_field(research domain) - Label:
title
- X & Y Coordinates:
- Related code file: hcp.py
- Potential columns for visualization:
Harris vs Trump Debate Dataset
- Raw Data: Transcript of a political debate between Kamala Harris and Donald Trump.
- Prepared Data: harris_vs_trump_debate_with_extras.parquet (1,141 rows, 21 columns)
- Potential columns for visualization:
- X & Y Coordinates:
tsne__x,tsne__y,pca__x,pca__y - Point Size:
certainty - Color:
speaker_color - Label:
text
- X & Y Coordinates:
- Related code file: No specific code file referenced.
- Potential columns for visualization:
Spotify Playlists Dataset
- Raw Data: Metadata on popular songs from various playlists, including holiday songs and the greatest 500 songs.
- Prepared Data: holiday_songs_spotify_with_embeddings.parquet (167 rows, 27 columns)
- Potential columns for visualization:
- X & Y Coordinates:
umap_x,umap_y,tsne_x,tsne_y - Point Size:
popularity - Color:
genre(derived from playlist) - Label:
track_name
- X & Y Coordinates:
- Related code file: Not specified.
- Potential columns for visualization:
LMSys Chat Conversations Dataset
- Raw Data: Conversations from AI chat systems.
- Prepared Data: lmsys_with_planar_embeddings_pca500.parquet (2,835,490 rows, 38 columns)
- Potential columns for visualization:
- X & Y Coordinates:
x_umap,y_umap - Point Size:
num_of_tokens - Color:
model - Label:
content
- X & Y Coordinates:
- Related code file: lmsys_ai_conversations.py
- Potential columns for visualization:
Prompt Injections Dataset
- Raw Data: Data related to prompt injection attacks and defenses.
- Prepared Data: prompt_injection_w_umap_embeddings.tsv (662 rows, 6 columns)
- Potential columns for visualization:
- X & Y Coordinates:
x,y - Point Size:
size - Color:
label - Label:
text
- X & Y Coordinates:
- Related code file: prompt_injections.py
- Potential columns for visualization:
Quotes Dataset
- Raw Data: Collection of 1,638 famous quotes.
- Prepared Data: micheleriva_1638_quotes_planar_embeddings.parquet (1,638 rows, 3 columns)
- Potential columns for visualization:
- X & Y Coordinates:
x,y - Label:
quote
- X & Y Coordinates:
- Related code file: Not specified.
- Potential columns for visualization:
Usage Instructions
- Load the prepared
.parquetfiles into a Pandas DataFrame. - Use Cosmograph or another visualization tool to create scatter or force-directed plots.
- Customize the x/y coordinates, size, color, and labels based on your analysis needs.
Acknowledgments
- The data has been curated and prepared by Thor Whalen and contributors.
- Data sources include Kaggle, Hugging Face, GitHub, and various public datasets.
For further details, please refer to the individual dataset documentation or the linked preparation scripts.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cosmodata-0.0.4.tar.gz.
File metadata
- Download URL: cosmodata-0.0.4.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31b59825b0f4fc0a2a5fdcc2650791a50cf2956603c77868e12236aba18c15dc
|
|
| MD5 |
9fe5136e833385380a34cbe531429f7f
|
|
| BLAKE2b-256 |
d4548acae5b6e4909e23153ee255466e4f4f75b2b57bed5febee05ca3e72df41
|
File details
Details for the file cosmodata-0.0.4-py3-none-any.whl.
File metadata
- Download URL: cosmodata-0.0.4-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6945f593977c7257eff88e409fe432664afa2da94d6331bc04e87142b55271db
|
|
| MD5 |
e82b451734ce92287fad2610581c6b2a
|
|
| BLAKE2b-256 |
4603fff43ac96ff60cc514cdcf690205bb320524fbd710399bb839a03364836a
|