softverse

Auto-compute Citations to Software From Replication Files

These details have not been verified by PyPI

Project description

Softverse: Auto-compute Citations to Software From Replication Files

We analyze replication files from 34 social science journals including the APSR, AJPS, JoP, BJPolS, Political Analysis, World Politics, Political Behavior, etc. posted to the Harvard Dataverse to tally the libraries used. This can be used as a way to calculate citation metrics for software.

see: https://gojiberries.io/2023/07/02/hard-problems-about-research-software/

Installation

Prerequisites

Python 3.11 or higher
uv package manager

Install uv (if not already installed)

curl -LsSf https://astral.sh/uv/install.sh | sh

Install Softverse

uv pip install softverse

Development Setup

git clone https://github.com/recite/softverse.git
cd softverse
uv sync --all-extras

Usage

Quick Start

Run the complete data collection and analysis pipeline:

uv run run-pipeline

Individual Components

1. Collect Datasets from Dataverse

# Collect datasets using configuration file
uv run collect-datasets --config config/settings.yaml --output-dir outputs/data/datasets/

# Force refresh to re-download all datasets
uv run collect-datasets --force-refresh --output-dir outputs/data/datasets/

# Use custom CSV input file with dataverse information
uv run collect-datasets --input-csv data/dataverse_socialscience.csv --output-dir outputs/data/datasets/

2. Collect Scripts and Code Files

# Collect scripts from all sources (Dataverse, Zenodo, ICPSR)
uv run collect-scripts --source all --base-output-dir outputs/scripts/

# Collect only from Dataverse
uv run collect-scripts --source dataverse --datasets-dir outputs/data/datasets/ --base-output-dir outputs/scripts/

# Collect from Zenodo with specific communities
uv run collect-scripts --source zenodo --zenodo-communities harvard-dataverse --max-zenodo-records 1000

# Collect from ICPSR with query
uv run collect-scripts --source icpsr --icpsr-query "political science" --max-icpsr-studies 500

3. Analyze Software Imports

# Analyze imports from collected scripts
uv run analyze-imports --scripts-dir outputs/scripts/ --output-dir outputs/analysis/

# Specify script patterns to analyze
uv run analyze-imports --scripts-dir outputs/scripts/ --output-dir outputs/analysis/ --config config/settings.yaml

4. Collect from OSF (Open Science Framework)

# Note: OSF collector is available via Python API (see below)
from softverse.collectors import OSFCollector

5. Collect from ResearchBox

# Note: ResearchBox collector is available via Python API (see below)
from softverse.collectors import ResearchBoxCollector

Configuration

API Keys and Authentication

API keys can be configured in two ways:

Environment Variables (Recommended for security):

export DATAVERSE_API_KEY="your-dataverse-api-key"
export OSF_API_TOKEN="your-osf-personal-access-token"
export ZENODO_ACCESS_TOKEN="your-zenodo-access-token"
export ICPSR_USERNAME="your-icpsr-username"
export ICPSR_PASSWORD="your-icpsr-password"

Configuration File (config/settings.yaml):

dataverse:
  base_url: "https://dataverse.harvard.edu"
  api_key: "your-api-key-here"  # Optional, for authenticated requests
  input_csv: "data/dataverse_socialscience.csv"

osf:
  api_token: "your-osf-token-here"  # Optional, for authenticated requests
  rate_limit_delay: 0.5

zenodo:
  access_token: "your-zenodo-token"  # Optional, for authenticated requests
  communities: ["harvard-dataverse"]

icpsr:
  username: "your-username"
  password: "your-password"

researchbox:
  base_url: "https://researchbox.org"
  concurrent_downloads: 5

# Output directories
output:
  datasets_dir: "outputs/data/datasets"
  scripts_dir: "outputs/scripts"
  analysis_dir: "outputs/analysis"
  logs_dir: "outputs/logs"

Note: Environment variables take precedence over configuration file values for sensitive data like API keys.

Python API

from pathlib import Path
from softverse.collectors import (
    DataverseCollector,
    OSFCollector,
    ResearchBoxCollector,
    ZenodoCollector
)
from softverse.analyzers import ImportAnalyzer

# Initialize collectors
dataverse = DataverseCollector()
osf = OSFCollector()
researchbox = ResearchBoxCollector()

# Collect datasets from Harvard Dataverse
datasets = dataverse.collect_from_dataverse_csv(
    csv_path="data/dataverse_socialscience.csv",
    output_dir=Path("outputs/datasets")
)

# Search and collect from OSF
osf_results = osf.search_nodes("reproducibility")
osf.collect_nodes(
    node_ids=[node["id"] for node in osf_results[:10]],
    output_dir=Path("outputs/osf"),
    download_files=True
)

# Collect from ResearchBox
researchbox.collect_range(
    start_id=1,
    end_id=100,
    output_dir=Path("outputs/researchbox"),
    extract=True
)

# Analyze R package imports
analyzer = ImportAnalyzer()
results = analyzer.analyze_directory(
    directory=Path("outputs/scripts"),
    output_dir=Path("outputs/analysis")
)

# Get summary statistics
summary = analyzer.generate_summary_statistics(results)
print(f"Total scripts analyzed: {summary['total_files']}")
print(f"Top R packages: {summary['top_r_packages'][:10]}")
print(f"Top Python packages: {summary['top_python_packages'][:10]}")

Scripts

Datasets by Dataverse produces list of datasets by dataverse (.gz)
List And Download All (R) Scripts Per Dataset takes the files from step #1 and produces list of files per dataset (.gz) and downloads those scripts (dump here)
Regex the files to tally imports takes the output from step #2 and produces imports per file and imports per package (if there are multiple imports per repository, we only count it once). A snippet of that last file can be seen below.

p.s. Deprecated R Files here

Top R Package Imports

package	count
ggplot2	1322
foreign	1009
stargazer	901
dplyr	789
tidyverse	720
xtable	608
plyr	485
lmtest	451
MASS	442
gridExtra	420
sandwich	394
haven	356
car	342
readstata13	339
reshape2	324
stringr	318
texreg	273
data.table	263
scales	257
tidyr	253
grid	247
lme4	241
Hmisc	236
lubridate	223
readxl	218
broom	195
lfe	190
RColorBrewer	188
ggpubr	188
estimatr	174

Authors

Gaurav Sood and Daniel Weitzel

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Nov 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

softverse-0.1.0.tar.gz (61.2 kB view details)

Uploaded Nov 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

softverse-0.1.0-py3-none-any.whl (78.3 kB view details)

Uploaded Nov 30, 2025 Python 3

File details

Details for the file softverse-0.1.0.tar.gz.

File metadata

Download URL: softverse-0.1.0.tar.gz
Upload date: Nov 30, 2025
Size: 61.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for softverse-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`aab90a9a2902b6100fb8625804628815a40502060d4c0b07ef292d53a89b9763`
MD5	`1254e93e3fb42974269fdf266127949b`
BLAKE2b-256	`5d74ec5f930e0c8aeb2b56487dae63faea1172cb4ff759e49d41d7da18d328a8`

See more details on using hashes here.

File details

Details for the file softverse-0.1.0-py3-none-any.whl.

File metadata

Download URL: softverse-0.1.0-py3-none-any.whl
Upload date: Nov 30, 2025
Size: 78.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for softverse-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f76ce8230252634ba7a90d209b8a2a65df956679323b7bef6f68716ed4013be1`
MD5	`591a87b0046d65e8c41b52b644194b64`
BLAKE2b-256	`9d05f76d5f690b1af1d7643243175ecdecc32c8bf0fe0c44be87d42d425aaecb`

See more details on using hashes here.

softverse 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Softverse: Auto-compute Citations to Software From Replication Files

Installation

Prerequisites

Install uv (if not already installed)

Install Softverse

Development Setup

Usage

Quick Start

Individual Components

1. Collect Datasets from Dataverse

2. Collect Scripts and Code Files

3. Analyze Software Imports

4. Collect from OSF (Open Science Framework)

5. Collect from ResearchBox

Configuration

API Keys and Authentication

Python API

Scripts

Top R Package Imports

Authors

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes