LLM-powered category analysis for academic paper abstracts

These details have not been verified by PyPI

Project links

Project description

cat-ademic

LLM-powered category analysis for academic paper abstracts via OpenAlex.

The Problem

If you study a research field, you know the challenge: hundreds of papers need to be characterized before you can map what a journal publishes, how methods are evolving, or where the gaps are. Manual reading doesn't scale. Keyword search misses nuance.

The Solution

cat-ademic fetches paper abstracts directly from OpenAlex and uses LLMs to classify, extract, and explore categories across them. It handles:

Category Assignment (classify): Classify papers into your predefined categories (multi-label supported)
Category Extraction (extract): Automatically discover and extract categories from abstracts when you don't have a predefined scheme
Category Exploration (explore): Analyze category stability and saturation through repeated raw extraction

No manual downloading. Point it at a journal ISSN (or OpenAlex topic), set a date range, and get back a structured CSV.

Installation
Quick Start
Best Practices for Classification
Configuration
Supported Models
API Reference
Related Projects
Academic Research
Contributing & Support
License

Installation

pip install cat-ademic

Quick Start

import catademic as cat
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.environ["OPENAI_API_KEY"]

# Classify 250 recent papers from Social Science Computer Review
results = cat.classify(
    categories=[
        "Introduces new computational tool or method",
        "Applies LLM/AI to social science data",
        "Evaluates or benchmarks a method",
        "Improves survey or data collection",
        "Theory-driven / conceptual",
        "Other",
    ],
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Academic papers from Social Science Computer Review",
    api_key=api_key,
    filename="out/sscr_classified.csv",
)

print(results.head(10))

# Discover emergent categories without a predefined scheme
raw_categories = cat.explore(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Academic papers from Social Science Computer Review",
    api_key=api_key,
    filename="out/sscr_categories_raw.csv",
)

print(f"Total raw category strings extracted: {len(raw_categories)}")

Best Practices for Classification

These recommendations are based on empirical testing across multiple datasets and models (7B to frontier-class).

What works

Detailed category descriptions: The single biggest lever for accuracy. Instead of short labels like "Methods paper", use verbose descriptions like "Introduces a new computational tool or method, including software packages, algorithms, or pipelines." This consistently improves accuracy across all models.
Include an "Other" category: Adding a catch-all category prevents the model from forcing ambiguous papers into ill-fitting categories.
Low temperature (creativity=0): For classification tasks, deterministic output is generally preferable.

What doesn't help (or hurts)

Chain of Thought (chain_of_thought): Does not reliably improve classification accuracy and adds cost.
Chain of Verification (chain_of_verification): Uses ~4x the API calls. Tends to retract correct classifications during the verification step. Not recommended.
Step-back prompting (step_back_prompt): Inconsistent results across datasets. Not recommended as a default.

Summary

The most effective approach is: write detailed category descriptions, include an "Other" category, and use a capable model at low temperature.

Configuration

Get Your API Key

Get an API key from your preferred provider:

OpenAI: platform.openai.com
Anthropic: console.anthropic.com
Google: aistudio.google.com
Huggingface: huggingface.co/settings/tokens
xAI: console.x.ai
Mistral: console.mistral.ai
Perplexity: perplexity.ai/settings/api

OpenAlex is unauthenticated and requires no key. Providing a polite_email in your requests is recommended for higher rate limits.

Supported Models

OpenAI: GPT-4o, GPT-4, GPT-5, etc.
Anthropic: Claude Sonnet 4, Claude 3.5 Sonnet, Claude Haiku, etc.
Google: Gemini 2.5 Flash, Gemini 2.5 Pro, etc.
Huggingface: Qwen, Llama 4, DeepSeek, and thousands of community models
xAI: Grok models
Mistral: Mistral Large, etc.
Perplexity: Sonar Large, Sonar Small, etc.

Note: For best results, start with OpenAI or Anthropic.

API Reference

`classify()`

Classify academic paper abstracts into predefined categories. Abstracts are fetched automatically from OpenAlex when you provide a journal_issn or topic_id.

Parameters:

categories (list): List of category descriptions for classification
journal_issn (str): Journal ISSN to pull abstracts from via OpenAlex
journal_name (str, optional): Journal name for filtering by name instead of ISSN
topic_id (str, optional): OpenAlex topic ID to pull papers by research topic
topic_name (str, optional): OpenAlex topic name to pull papers by research topic
paper_limit (int, default=50): Number of papers to fetch
date_from (str, optional): Start date filter as "YYYY-MM-DD"
date_to (str, optional): End date filter as "YYYY-MM-DD"
polite_email (str, optional): Email for OpenAlex polite pool (higher rate limits)
api_key (str): API key for the LLM service
description (str): Description of the corpus context
user_model (str, default="gpt-4o"): Model to use
model_source (str, default="auto"): Provider — "auto", "openai", "anthropic", "google", "mistral", "perplexity", "huggingface", "xai"
creativity (float, optional): Temperature setting (0.0–1.0)
chain_of_thought (bool, default=False): Enable step-by-step reasoning
filename (str, optional): Output CSV filename
save_directory (str, optional): Directory to save results

Returns:

pandas.DataFrame: Results with one binary column per category (category_1, category_2, …), plus title, doi, publication_date, cited_by_count

Example:

import catademic as cat

results = cat.classify(
    categories=[
        "Introduces new computational tool or method",
        "Applies LLM/AI to social science data",
        "Theory-driven / conceptual",
        "Other",
    ],
    journal_issn="0894-4393",
    paper_limit=100,
    date_from="2023-01-01",
    description="Social Science Computer Review papers",
    api_key=api_key,
    filename="sscr_classified.csv",
)

# Add readable column names
results["new_tool"] = results["category_1"]
results["applies_ai"] = results["category_2"]
results.to_csv("sscr_classified.csv", index=False)

`extract()`

Automatically discover and extract categories from paper abstracts when you don't have a predefined scheme. Returns a clean, deduplicated, semantically merged set of categories.

Parameters:

journal_issn (str): Journal ISSN to pull abstracts from via OpenAlex
journal_name (str, optional): Journal name for filtering
topic_id (str, optional): OpenAlex topic ID
topic_name (str, optional): OpenAlex topic name
paper_limit (int, default=50): Number of papers to fetch
date_from (str, optional): Start date filter as "YYYY-MM-DD"
date_to (str, optional): End date filter as "YYYY-MM-DD"
polite_email (str, optional): Email for OpenAlex polite pool
api_key (str): API key for the LLM service
description (str): Description of the corpus
max_categories (int, default=12): Maximum number of categories to return
categories_per_chunk (int, default=10): Categories to extract per chunk
divisions (int, default=12): Number of chunks to divide data into
iterations (int, default=8): Number of extraction passes over the data
user_model (str, default="gpt-4o"): Model to use
specificity (str, default="broad"): "broad" or "specific" category granularity
research_question (str, optional): Research context to guide extraction
focus (str, optional): Focus instruction (e.g., "methodological contributions")
filename (str, optional): Output CSV filename
model_source (str, default="auto"): Provider

Returns:

dict with keys:
- counts_df: DataFrame of categories with counts
- top_categories: List of top category names
- raw_top_text: Raw model output

Example:

import catademic as cat

results = cat.extract(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Social Science Computer Review papers",
    api_key=api_key,
    max_categories=10,
    focus="methodological contributions",
)

print(results["top_categories"])
# ['Computational text analysis', 'Survey methodology', 'Network analysis', ...]

`explore()`

Raw category extraction for frequency and saturation analysis. Unlike extract(), which normalizes and merges categories into a clean final set, explore() returns every category string from every chunk across every iteration — with duplicates intact.

This is useful for analyzing which categories are robust (consistently discovered across runs) versus noise (appearing only once or twice). Increasing iterations lets you build saturation curves showing when category discovery converges.

Parameters:

journal_issn (str): Journal ISSN to pull abstracts from via OpenAlex
journal_name (str, optional): Journal name for filtering
topic_id (str, optional): OpenAlex topic ID
topic_name (str, optional): OpenAlex topic name
paper_limit (int, default=50): Number of papers to fetch
date_from (str, optional): Start date filter as "YYYY-MM-DD"
date_to (str, optional): End date filter as "YYYY-MM-DD"
polite_email (str, optional): Email for OpenAlex polite pool
api_key (str): API key for the LLM service
description (str): Description of the corpus
categories_per_chunk (int, default=10): Categories to extract per chunk
divisions (int, default=12): Number of chunks to divide data into
iterations (int, default=8): Number of passes over the data
user_model (str, default="gpt-4o"): Model to use
specificity (str, default="broad"): "broad" or "specific" category granularity
research_question (str, optional): Research context to guide extraction
focus (str, optional): Focus instruction for extraction
random_state (int, optional): Random seed for reproducibility
filename (str, optional): Output CSV filename (one category per row)
model_source (str, default="auto"): Provider

Returns:

list[str]: Every category extracted from every chunk across every iteration. Length ≈ iterations × divisions × categories_per_chunk.

Example:

import catademic as cat
from collections import Counter

raw_categories = cat.explore(
    journal_issn="0894-4393",
    paper_limit=250,
    date_from="2023-01-01",
    description="Social Science Computer Review papers",
    api_key=api_key,
    iterations=20,
    filename="sscr_categories_raw.csv",
)

counts = Counter(raw_categories)
for category, freq in counts.most_common(15):
    print(f"{freq:3d}x  {category}")

Related Projects

cat-llm: The survey response version of this tool — classifies and extracts categories from open-ended survey responses, images, and PDFs.
llm-web-research: LLM-powered web research with a Funnel of Verification methodology.

Academic Research

If you use this package for research, please cite:

Soria, C. (2025). cat-ademic (0.1.0). GitHub. https://github.com/chrissoria/cat-ademic

Contributing & Support

Report bugs or request features: Open a GitHub Issue
Research collaboration: Email ChrisSoria@Berkeley.edu

License

cat-ademic is distributed under the terms of the GNU GPL-3.0 license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

May 17, 2026

0.2.0

May 12, 2026

0.1.2

Mar 19, 2026

0.1.1

Mar 5, 2026

This version

0.1.0

Mar 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_ademic-0.1.0.tar.gz (435.2 kB view details)

Uploaded Mar 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cat_ademic-0.1.0-py3-none-any.whl (458.3 kB view details)

Uploaded Mar 5, 2026 Python 3

File details

Details for the file cat_ademic-0.1.0.tar.gz.

File metadata

Download URL: cat_ademic-0.1.0.tar.gz
Upload date: Mar 5, 2026
Size: 435.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_ademic-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b8135d35bcd292906b427c2d8486b370146fd555146b4731b7998936492c7ed6`
MD5	`1ca46f46ebf5deff99601309a10a9bb1`
BLAKE2b-256	`ca7aae6668e3b3b449a97815f031ee5148230a2742076fd2e916df46d615bad2`

See more details on using hashes here.

File details

Details for the file cat_ademic-0.1.0-py3-none-any.whl.

File metadata

Download URL: cat_ademic-0.1.0-py3-none-any.whl
Upload date: Mar 5, 2026
Size: 458.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.14

File hashes

Hashes for cat_ademic-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`75343c68ea096446877fd315f4441af6605fa881b8d6b23b204ca073ad6c1073`
MD5	`38f32cfaf5eb9798340b270b39c29d9e`
BLAKE2b-256	`f3630be736cb34fad2674eeaa8eb22435340b81b17e7146bd35388bf854cee07`

See more details on using hashes here.

cat-ademic 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cat-ademic

The Problem

The Solution

Table of Contents

Installation

Quick Start

Best Practices for Classification

What works

What doesn't help (or hurts)

Summary

Configuration

Get Your API Key

Supported Models

API Reference

classify()

extract()

explore()

Related Projects

Academic Research

Contributing & Support

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`classify()`

`extract()`

`explore()`