Skip to main content

Automated tool for assessing predatory journals using multiple backend sources

Project description

Aletheia-Probe: Automated Integrity Checks for Academic Journals & Conferences

CI/CD Pipeline License: MIT Python 3.10+

๐Ÿšง Beta Release (v0.1.0) - This tool is currently in beta testing. We welcome feedback and bug reports from early adopters. See BETA-TESTING.md for how to help test and provide feedback.

Aletheia-Probe is a comprehensive command-line tool for evaluating the legitimacy of academic journals and conferences. By aggregating data from authoritative sources and applying advanced pattern analysis, it helps researchers, librarians, and institutions detect predatory venues and ensure the integrity of scholarly publishing.

About the Name

The name "Aletheia" (แผ€ฮปฮฎฮธฮตฮนฮฑ) comes from ancient Greek philosophy, where it represents the concept of truth and unconcealment. In Greek mythology, Aletheia was personified as the goddess or spirit (daimona) of truth and sincerity. This reflects the tool's core mission: to reveal the truth about academic journals and conferences, helping researchers distinguish legitimate venues from predatory ones. The suffix "-Probe" emphasizes the tool's investigative natureโ€”actively examining and uncovering the reality behind scholarly publishing claims.

TL;DR

# Beta release - Install from PyPI or source

# Option 1: Install from PyPI (recommended for beta testing)
pip install aletheia-probe

# Option 2: Install from source (for development)
git clone https://github.com/sustainet-guardian/aletheia-probe.git
cd aletheia-probe
pip install -e .

# Optional: Install unrar/unar if you want to use all data sources
# (required for Algerian Ministry source - skipped if not available)
# Debian/Ubuntu:
sudo apt-get install unrar
# macOS:
brew install unar
# Windows (via chocolatey):
choco install unrar

# First time: Sync data sources (takes a few minutes)
aletheia-probe sync

# Check the current state of the cache database
aletheia-probe status

# Assess a single journal
aletheia-probe journal "Journal of Computer Science"

# Assess all journals in a BibTeX file (returns exit code 1 if predatory journals found)
aletheia-probe bibtex references.bib

# Get detailed analysis with confidence scores from multiple sources
aletheia-probe journal --format json "Nature Reviews Drug Discovery"

Output: Combines data from multiple authoritative sources and advanced pattern analysis to provide confidence-scored assessments of journal legitimacy.

Note: The first sync downloads and processes data from multiple sources (DOAJ, Beall's List, etc.), which takes a few minutes. After that, queries typically complete in under 5 seconds.

Data Sources

This tool acts as a data aggregator - it doesn't provide data itself, but combines information from multiple authoritative sources:

  • DOAJ - Directory of Open Access Journals
  • Beall's List - Historical predatory journal archives
  • Algerian Ministry - Algerian Ministry of Higher Education predatory journals list
  • OpenAlex - Publication pattern analysis
  • Crossref - Metadata quality assessment
  • Retraction Watch - Journal retraction history analysis
  • Scopus - Optional premium journal database
  • Institutional Lists - Custom whitelist/blacklist configurations
  • Cross-Validator - Cross-source consistency validation system
  • Kscien Standalone Journals - Individual predatory journals identified by Kscien
  • Kscien Publishers - Known predatory publishers
  • Kscien Hijacked Journals - Legitimate journals that have been hijacked by predatory actors
  • Kscien Predatory Conferences - Database of predatory conferences

The tool analyzes publication patterns, citation metrics, and metadata quality to provide comprehensive coverage beyond traditional blacklist/whitelist approaches.

Note on Conference Assessment: Conference checking is currently limited compared to journal assessment. The primary source for conference evaluation is the Kscien Predatory Conferences database. Most other data sources focus exclusively on journals, so conference assessments may have less comprehensive coverage and fewer cross-validation opportunities.

Quick Start

See the Quick Start Guide for installation instructions and basic usage examples.

Assessment Methodology

The tool uses a hybrid approach combining curated databases with advanced pattern analysis to achieve comprehensive coverage and high accuracy.

Backend Types

Curated Databases (High Trust)

These provide authoritative yes/no decisions for journals they cover:

Backend Type Coverage Purpose
DOAJ Legitimate OA journals 22,000+ journals Gold standard for open access legitimacy
Scopus (optional) Legitimate indexed journals 30,000+ journals Major subscription and OA journals
Beall's List Predatory journal archives ~2,900 entries Historically identified predatory publishers
PredatoryJournals.org Predatory journals/publishers Community-maintained Curated lists from predatoryjournals.org
Algerian Ministry Predatory journal list ~3,300 entries Ministry of Higher Education predatory journals
Kscien Standalone Journals Predatory journals 1,400+ entries Individual predatory journals identified by Kscien
Kscien Publishers Predatory publishers 1,200+ entries Known predatory publishers
Kscien Hijacked Journals Hijacked journals ~200 entries Legitimate journals compromised by predatory actors
Kscien Predatory Conferences Predatory conferences ~450 entries Identified predatory conference venues
Retraction Watch Quality indicator ~27,000 journals Retraction rates and patterns for quality assessment
Institutional Lists Custom whitelist/blacklist Organization-specific Local policy enforcement

Pattern Analysis (Evidence-Based)

These analyze publication patterns and metadata quality to detect predatory characteristics:

Backend Data Source What It Analyzes Key Indicators
OpenAlex Analyzer OpenAlex API (240M+ works) Publication volume, citation patterns, author diversity, growth rates Abnormal publication volumes (>1000/year), suspicious citation ratios, rapid growth patterns
Crossref Analyzer Crossref metadata API Metadata completeness, abstracts, references, author information Missing metadata, poor quality abstracts (<100 chars), low reference counts
Cross-Validator Cross-source data Publisher name consistency, data correlation across sources Mismatched publisher names, data inconsistencies between sources

How Assessment Works

1. Multi-Backend Query

The tool queries all enabled backends concurrently for comprehensive coverage:

Journal Query โ†’ [Curated Databases + Pattern Analyzers] โ†’ Combined Assessment
                 โ”‚
                 โ”œโ”€ DOAJ (legitimate OA)
                 โ”œโ”€ Scopus (indexed journals)
                 โ”œโ”€ Beall's List (predatory)
                 โ”œโ”€ PredatoryJournals.org
                 โ”œโ”€ Kscien databases
                 โ”œโ”€ Retraction Watch (quality)
                 โ”œโ”€ OpenAlex Analyzer (patterns)
                 โ”œโ”€ Crossref Analyzer (metadata)
                 โ””โ”€ Cross-Validator (consistency)

Note: Not all backends will find every journal. A journal may be:

  • Found in DOAJ โ†’ strong legitimate evidence
  • Found in Beall's โ†’ strong predatory evidence
  • Not found in any curated database โ†’ rely on pattern analysis
  • Found in contradictory sources โ†’ cross-validation resolves conflicts

2. Assessment Logic

Curated Database Results (Authoritative):

  • DOAJ/Scopus match โ†’ Classified as legitimate (high confidence)
  • Predatory list match โ†’ Classified as predatory (high confidence)
  • No matches found โ†’ Proceed to pattern analysis

Pattern Analysis (Evidence-Based): When curated databases don't have the journal, pattern analyzers evaluate quality:

๐ŸŸข Legitimacy Indicators (OpenAlex/Crossref):

  • Consistent publication volume (20-500 papers/year)
  • Healthy citation patterns (>3 citations/paper average)
  • Complete metadata (abstracts >100 chars, references, author ORCIDs)
  • Recognized publisher with history
  • Stable growth patterns

๐Ÿ”ด Predatory Indicators (OpenAlex/Crossref):

  • Publication mill patterns (>1000 papers/year)
  • Extremely low citations (<0.5/paper)
  • Incomplete metadata (no abstracts, missing author info)
  • Suspicious/unknown publisher
  • Sudden publication volume spikes

3. Confidence Scoring

Final confidence is determined by:

  • Source authority: DOAJ/Scopus > Pattern analysis > Smaller lists
  • Agreement: Multiple sources agreeing โ†’ higher confidence
  • Evidence strength: Strong indicators > weak signals
  • Cross-validation: Consistent data across sources increases confidence
  • Retraction data: High retraction rates lower confidence for "legitimate" journals

4. Result Combination

The dispatcher aggregates all backend results:

  • Conflicting assessments are resolved by source weight
  • Multiple agreeing sources boost confidence
  • Pattern analysis supplements curated databases
  • Detailed reasoning explains the assessment

Example Assessment Scenarios

Scenario 1: Well-Known Legitimate Journal

Input: "Nature"
โ”‚
โ”œโ”€ DOAJ: โœ— Not found (subscription journal, not open access)
โ”œโ”€ Scopus: โœ“ Found โ†’ "legitimate"
โ”œโ”€ Predatory Lists: โœ— Not found
โ”œโ”€ Retraction Watch: โœ“ Found โ†’ 153 retractions, 0.034% rate (within normal)
โ”œโ”€ OpenAlex: โœ“ Found โ†’ 446,231 publications, healthy citations
โ”œโ”€ Crossref: โœ“ Found โ†’ Complete metadata, Nature Publishing Group
โ”‚
Result: LEGITIMATE (confidence: 0.95)
Reasoning: "Found in Scopus with excellent publication patterns and metadata quality"

Scenario 2: Known Predatory Journal

Input: "International Journal of Advanced Computer Science and Applications"
โ”‚
โ”œโ”€ DOAJ: โœ— Not found
โ”œโ”€ Predatory Lists: โœ“ Found in Kscien database โ†’ "predatory"
โ”œโ”€ Retraction Watch: โœ— Not found
โ”œโ”€ OpenAlex: โœ“ Found โ†’ High volume (>800/year), low citations
โ”œโ”€ Crossref: โœ“ Found โ†’ Poor metadata quality
โ”‚
Result: PREDATORY (confidence: 0.90)
Reasoning: "Listed in Kscien predatory database, confirmed by publication patterns"

Scenario 3: Unknown Journal (Pattern Analysis)

Input: "Emerging Regional Journal"
โ”‚
โ”œโ”€ DOAJ: โœ— Not found
โ”œโ”€ Scopus: โœ— Not found
โ”œโ”€ Predatory Lists: โœ— Not found
โ”œโ”€ OpenAlex: โœ“ Found โ†’ 150 papers/year, 5 citations/paper average
โ”œโ”€ Crossref: โœ“ Found โ†’ Good metadata, established publisher
โ”‚
Result: INSUFFICIENT_DATA (confidence: 0.45)
Reasoning: "Not in major databases; pattern analysis suggests legitimate practices but low confidence"

Optional: Scopus Journal List

To enhance coverage with Scopus data:

  1. Download the spreadsheet from researchgate.net
  2. Create directory: mkdir -p .aletheia-probe/scopus
  3. Place Excel file (e.g., ext_list_October_2024.xlsx) in this directory
  4. Run aletheia-probe sync to process the data

Benefits: Adds nearly 30,000 subscription journals from major publishers (Elsevier, Springer, Wiley, etc.)

Documentation

AI-Assisted Development

This project explicitly encourages and welcomes the use of AI coding agents as part of the development workflow. AI tools are valuable for:

  • Code generation and refactoring
  • Writing tests and documentation
  • Code review and quality improvement
  • Problem-solving and debugging
  • Learning and exploring the codebase

Guidelines for AI-Assisted Contributions:

  1. Review and Verify: All AI-generated code must be thoroughly reviewed, tested, and understood before submission
  2. Quality Standards: AI-assisted code must meet the same quality standards as manually written code (see AICodingAgent.md)
  3. Transparency: Contributors may optionally indicate AI assistance in commit messages or PR descriptions using tags like [AI-assisted] for transparency
  4. Responsibility: Contributors remain fully responsible for all submitted code, regardless of how it was generated
  5. Security: Extra attention should be paid to security implications of AI-generated code

For detailed guidelines on using AI coding agents with this project, see AICodingAgent.md.

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aletheia_probe-0.1.0.tar.gz (109.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aletheia_probe-0.1.0-py3-none-any.whl (132.0 kB view details)

Uploaded Python 3

File details

Details for the file aletheia_probe-0.1.0.tar.gz.

File metadata

  • Download URL: aletheia_probe-0.1.0.tar.gz
  • Upload date:
  • Size: 109.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for aletheia_probe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f8763370ddfe570aea871542cad438aef774718c13067e20bf798f872185d79f
MD5 1fa1aa9b7898f256563df65f4553a5e0
BLAKE2b-256 3c5beaebe07e5749a81b04f138a1ac43af9a7fdd9ca3cd7608d44b15408a82c5

See more details on using hashes here.

File details

Details for the file aletheia_probe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: aletheia_probe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 132.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for aletheia_probe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5ef658ea00189177ecacc825b2aa9ab2fe18fb8f2aa4cab24c01163230aa4786
MD5 c812ed96d932578d128a41d4d269926a
BLAKE2b-256 fecbb38bd6368056c075905590d7124d13c1e97cd955ccfe02bb8e5d681fd241

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page