LLM-powered category analysis for academic paper abstracts
Project description
cat-ademic
LLM-powered category analysis for academic paper abstracts via OpenAlex.
The Problem
If you study a research field, you know the challenge: hundreds of papers need to be characterized before you can map what a journal publishes, how methods are evolving, or where the gaps are. Manual reading doesn't scale. Keyword search misses nuance.
The Solution
cat-ademic fetches paper abstracts directly from OpenAlex and uses LLMs to classify, extract, and explore categories across them. It handles:
- Category Assignment (
classify): Classify papers into your predefined categories (multi-label supported) - Category Extraction (
extract): Automatically discover and extract categories from abstracts when you don't have a predefined scheme - Category Exploration (
explore): Analyze category stability and saturation through repeated raw extraction
No manual downloading. Point it at a journal ISSN (or OpenAlex topic), set a date range, and get back a structured CSV.
Table of Contents
- Installation
- Quick Start
- Best Practices for Classification
- Configuration
- Supported Models
- API Reference
- Related Projects
- Academic Research
- Contributing & Support
- License
Installation
pip install cat-ademic
Quick Start
import catademic as cat
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.environ["OPENAI_API_KEY"]
# Classify 250 recent papers from Social Science Computer Review
results = cat.classify(
categories=[
"Introduces new computational tool or method",
"Applies LLM/AI to social science data",
"Evaluates or benchmarks a method",
"Improves survey or data collection",
"Theory-driven / conceptual",
"Other",
],
journal_issn="0894-4393",
paper_limit=250,
date_from="2023-01-01",
description="Academic papers from Social Science Computer Review",
api_key=api_key,
filename="out/sscr_classified.csv",
)
print(results.head(10))
# Discover emergent categories without a predefined scheme
raw_categories = cat.explore(
journal_issn="0894-4393",
paper_limit=250,
date_from="2023-01-01",
description="Academic papers from Social Science Computer Review",
api_key=api_key,
filename="out/sscr_categories_raw.csv",
)
print(f"Total raw category strings extracted: {len(raw_categories)}")
Best Practices for Classification
These recommendations are based on empirical testing across multiple datasets and models (7B to frontier-class).
What works
- Detailed category descriptions: The single biggest lever for accuracy. Instead of short labels like
"Methods paper", use verbose descriptions like"Introduces a new computational tool or method, including software packages, algorithms, or pipelines."This consistently improves accuracy across all models. - Include an "Other" category: Adding a catch-all category prevents the model from forcing ambiguous papers into ill-fitting categories.
- Low temperature (
creativity=0): For classification tasks, deterministic output is generally preferable.
What doesn't help (or hurts)
- Chain of Thought (
chain_of_thought): Does not reliably improve classification accuracy and adds cost. - Chain of Verification (
chain_of_verification): Uses ~4x the API calls. Tends to retract correct classifications during the verification step. Not recommended. - Step-back prompting (
step_back_prompt): Inconsistent results across datasets. Not recommended as a default.
Summary
The most effective approach is: write detailed category descriptions, include an "Other" category, and use a capable model at low temperature.
Configuration
Get Your API Key
Get an API key from your preferred provider:
- OpenAI: platform.openai.com
- Anthropic: console.anthropic.com
- Google: aistudio.google.com
- Huggingface: huggingface.co/settings/tokens
- xAI: console.x.ai
- Mistral: console.mistral.ai
- Perplexity: perplexity.ai/settings/api
OpenAlex is unauthenticated and requires no key. Providing a polite_email in your requests is recommended for higher rate limits.
Supported Models
- OpenAI: GPT-4o, GPT-4, GPT-5, etc.
- Anthropic: Claude Sonnet 4, Claude 3.5 Sonnet, Claude Haiku, etc.
- Google: Gemini 2.5 Flash, Gemini 2.5 Pro, etc.
- Huggingface: Qwen, Llama 4, DeepSeek, and thousands of community models
- xAI: Grok models
- Mistral: Mistral Large, etc.
- Perplexity: Sonar Large, Sonar Small, etc.
Note: For best results, start with OpenAI or Anthropic.
API Reference
classify()
Classify academic paper abstracts into predefined categories. Abstracts are fetched automatically from OpenAlex when you provide a journal_issn or topic_id.
Parameters:
categories(list): List of category descriptions for classificationjournal_issn(str): Journal ISSN to pull abstracts from via OpenAlexjournal_name(str, optional): Journal name for filtering by name instead of ISSNtopic_id(str, optional): OpenAlex topic ID to pull papers by research topictopic_name(str, optional): OpenAlex topic name to pull papers by research topicpaper_limit(int, default=50): Number of papers to fetchdate_from(str, optional): Start date filter as"YYYY-MM-DD"date_to(str, optional): End date filter as"YYYY-MM-DD"polite_email(str, optional): Email for OpenAlex polite pool (higher rate limits)api_key(str): API key for the LLM servicedescription(str): Description of the corpus contextuser_model(str, default="gpt-4o"): Model to usemodel_source(str, default="auto"): Provider — "auto", "openai", "anthropic", "google", "mistral", "perplexity", "huggingface", "xai"creativity(float, optional): Temperature setting (0.0–1.0)chain_of_thought(bool, default=False): Enable step-by-step reasoningfilename(str, optional): Output CSV filenamesave_directory(str, optional): Directory to save results
Returns:
pandas.DataFrame: Results with one binary column per category (category_1,category_2, …), plustitle,doi,publication_date,cited_by_count
Example:
import catademic as cat
results = cat.classify(
categories=[
"Introduces new computational tool or method",
"Applies LLM/AI to social science data",
"Theory-driven / conceptual",
"Other",
],
journal_issn="0894-4393",
paper_limit=100,
date_from="2023-01-01",
description="Social Science Computer Review papers",
api_key=api_key,
filename="sscr_classified.csv",
)
# Add readable column names
results["new_tool"] = results["category_1"]
results["applies_ai"] = results["category_2"]
results.to_csv("sscr_classified.csv", index=False)
extract()
Automatically discover and extract categories from paper abstracts when you don't have a predefined scheme. Returns a clean, deduplicated, semantically merged set of categories.
Parameters:
journal_issn(str): Journal ISSN to pull abstracts from via OpenAlexjournal_name(str, optional): Journal name for filteringtopic_id(str, optional): OpenAlex topic IDtopic_name(str, optional): OpenAlex topic namepaper_limit(int, default=50): Number of papers to fetchdate_from(str, optional): Start date filter as"YYYY-MM-DD"date_to(str, optional): End date filter as"YYYY-MM-DD"polite_email(str, optional): Email for OpenAlex polite poolapi_key(str): API key for the LLM servicedescription(str): Description of the corpusmax_categories(int, default=12): Maximum number of categories to returncategories_per_chunk(int, default=10): Categories to extract per chunkdivisions(int, default=12): Number of chunks to divide data intoiterations(int, default=8): Number of extraction passes over the datauser_model(str, default="gpt-4o"): Model to usespecificity(str, default="broad"):"broad"or"specific"category granularityresearch_question(str, optional): Research context to guide extractionfocus(str, optional): Focus instruction (e.g.,"methodological contributions")filename(str, optional): Output CSV filenamemodel_source(str, default="auto"): Provider
Returns:
dictwith keys:counts_df: DataFrame of categories with countstop_categories: List of top category namesraw_top_text: Raw model output
Example:
import catademic as cat
results = cat.extract(
journal_issn="0894-4393",
paper_limit=250,
date_from="2023-01-01",
description="Social Science Computer Review papers",
api_key=api_key,
max_categories=10,
focus="methodological contributions",
)
print(results["top_categories"])
# ['Computational text analysis', 'Survey methodology', 'Network analysis', ...]
explore()
Raw category extraction for frequency and saturation analysis. Unlike extract(), which normalizes and merges categories into a clean final set, explore() returns every category string from every chunk across every iteration — with duplicates intact.
This is useful for analyzing which categories are robust (consistently discovered across runs) versus noise (appearing only once or twice). Increasing iterations lets you build saturation curves showing when category discovery converges.
Parameters:
journal_issn(str): Journal ISSN to pull abstracts from via OpenAlexjournal_name(str, optional): Journal name for filteringtopic_id(str, optional): OpenAlex topic IDtopic_name(str, optional): OpenAlex topic namepaper_limit(int, default=50): Number of papers to fetchdate_from(str, optional): Start date filter as"YYYY-MM-DD"date_to(str, optional): End date filter as"YYYY-MM-DD"polite_email(str, optional): Email for OpenAlex polite poolapi_key(str): API key for the LLM servicedescription(str): Description of the corpuscategories_per_chunk(int, default=10): Categories to extract per chunkdivisions(int, default=12): Number of chunks to divide data intoiterations(int, default=8): Number of passes over the datauser_model(str, default="gpt-4o"): Model to usespecificity(str, default="broad"):"broad"or"specific"category granularityresearch_question(str, optional): Research context to guide extractionfocus(str, optional): Focus instruction for extractionrandom_state(int, optional): Random seed for reproducibilityfilename(str, optional): Output CSV filename (one category per row)model_source(str, default="auto"): Provider
Returns:
list[str]: Every category extracted from every chunk across every iteration. Length ≈iterations × divisions × categories_per_chunk.
Example:
import catademic as cat
from collections import Counter
raw_categories = cat.explore(
journal_issn="0894-4393",
paper_limit=250,
date_from="2023-01-01",
description="Social Science Computer Review papers",
api_key=api_key,
iterations=20,
filename="sscr_categories_raw.csv",
)
counts = Counter(raw_categories)
for category, freq in counts.most_common(15):
print(f"{freq:3d}x {category}")
Related Projects
- cat-llm: The survey response version of this tool — classifies and extracts categories from open-ended survey responses, images, and PDFs.
- llm-web-research: LLM-powered web research with a Funnel of Verification methodology.
Academic Research
If you use this package for research, please cite:
Soria, C. (2025). cat-ademic (0.1.0). GitHub. https://github.com/chrissoria/cat-ademic
Contributing & Support
- Report bugs or request features: Open a GitHub Issue
- Research collaboration: Email ChrisSoria@Berkeley.edu
License
cat-ademic is distributed under the terms of the GNU GPL-3.0 license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cat_ademic-0.1.0.tar.gz.
File metadata
- Download URL: cat_ademic-0.1.0.tar.gz
- Upload date:
- Size: 435.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8135d35bcd292906b427c2d8486b370146fd555146b4731b7998936492c7ed6
|
|
| MD5 |
1ca46f46ebf5deff99601309a10a9bb1
|
|
| BLAKE2b-256 |
ca7aae6668e3b3b449a97815f031ee5148230a2742076fd2e916df46d615bad2
|
File details
Details for the file cat_ademic-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cat_ademic-0.1.0-py3-none-any.whl
- Upload date:
- Size: 458.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75343c68ea096446877fd315f4441af6605fa881b8d6b23b204ca073ad6c1073
|
|
| MD5 |
38f32cfaf5eb9798340b270b39c29d9e
|
|
| BLAKE2b-256 |
f3630be736cb34fad2674eeaa8eb22435340b81b17e7146bd35388bf854cee07
|