Skip to main content

A portal to data sources

Project description

cosmodata

A portal to data sources for cosmograph

To install: pip install cosmodata

Datasets Overview

Introduction

This repository contains datasets for various projects, each prepared for visualization and analysis using Cosmograph. The raw data consists of structured information from sources like academic publications, GitHub repositories, political debates, and Spotify playlists. The prepared datasets feature embeddings and 2D projections that enable scatter and force-directed graph visualizations.

Dataset Descriptions

EuroVis Dataset

  • Raw Data: Academic publications metadata from the EuroVis conference, including titles, abstracts, authors, and awards.
  • Prepared Data: merged_artifacts.parquet (5599 rows, 18 columns)
    • Potential columns for visualization:
      • X & Y Coordinates: x, y
      • Point Size: n_tokens (number of tokens in the abstract)
      • Color: Cluster labels (cluster_05, cluster_08, etc.)
      • Label: title
    • Related code file: eurovis.py

GitHub Repositories Dataset

  • Raw Data: GitHub repository metadata including stars, forks, programming languages, and repository descriptions.
  • Prepared Data: github_repo_for_cosmos.parquet (3,065,063 rows, 28 columns)
    • Potential columns for visualization:
      • X & Y Coordinates: x, y
      • Point Size: stars (star count), forks
      • Color: primaryLanguage
      • Label: nameWithOwner
    • Related code file: github_repos.py

HCP Publications Dataset

  • Raw Data: Human Connectome Project (HCP) publications and citation networks.
  • Prepared Data: aggregate_titles_embeddings_umap_2d_with_info.parquet (340,855 rows, 9 columns)
    • Potential columns for visualization:
      • X & Y Coordinates: x, y
      • Point Size: n_cits (citation count)
      • Color: main_field (research domain)
      • Label: title
    • Related code file: hcp.py

Harris vs Trump Debate Dataset

  • Raw Data: Transcript of a political debate between Kamala Harris and Donald Trump.
  • Prepared Data: harris_vs_trump_debate_with_extras.parquet (1,141 rows, 21 columns)
    • Potential columns for visualization:
      • X & Y Coordinates: tsne__x, tsne__y, pca__x, pca__y
      • Point Size: certainty
      • Color: speaker_color
      • Label: text
    • Related code file: No specific code file referenced.

Spotify Playlists Dataset

  • Raw Data: Metadata on popular songs from various playlists, including holiday songs and the greatest 500 songs.
  • Prepared Data: holiday_songs_spotify_with_embeddings.parquet (167 rows, 27 columns)
    • Potential columns for visualization:
      • X & Y Coordinates: umap_x, umap_y, tsne_x, tsne_y
      • Point Size: popularity
      • Color: genre (derived from playlist)
      • Label: track_name
    • Related code file: Not specified.

LMSys Chat Conversations Dataset

Prompt Injections Dataset

Quotes Dataset

Usage Instructions

  1. Load the prepared .parquet files into a Pandas DataFrame.
  2. Use Cosmograph or another visualization tool to create scatter or force-directed plots.
  3. Customize the x/y coordinates, size, color, and labels based on your analysis needs.

Acknowledgments

  • The data has been curated and prepared by Thor Whalen and contributors.
  • Data sources include Kaggle, Hugging Face, GitHub, and various public datasets.

For further details, please refer to the individual dataset documentation or the linked preparation scripts.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cosmodata-0.0.5.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cosmodata-0.0.5-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file cosmodata-0.0.5.tar.gz.

File metadata

  • Download URL: cosmodata-0.0.5.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for cosmodata-0.0.5.tar.gz
Algorithm Hash digest
SHA256 789437083e7836ad93c9998956cd6fb6a6c99492e9b00dcf7bf62937b44fb6f9
MD5 26428df2bf647edb5b36f99bb9386c28
BLAKE2b-256 9fb06b87fb3b7033573ca145e5ffff252c12b4d975c4939c23f3ea7d4ff04897

See more details on using hashes here.

File details

Details for the file cosmodata-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: cosmodata-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for cosmodata-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 18c939c9450f0d907fbc8bedee91828831b881030cef066a1aeae72d892cff24
MD5 69f317819f2a3c19844991125321dcb4
BLAKE2b-256 062a7c52086daccaf5576503a37b0f028d1bcacb78cbba654e6982bb0d9518d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page