Skip to main content

Modular content analysis platform for research, assessment, and academic integrity checking

Project description

Extracta

edtech academic-integrity ai-analysis api citation-analysis cli-tool content-analysis modular-architecture plagiarism-detection python

Modular Content Analysis Platform for research, assessment, and academic integrity checking.

Extracta provides a unified interface for extracting and analyzing content from diverse media types including documents, images, repositories, and web content. It supports both research-focused deep analysis and assessment-oriented quality evaluation, with specialized tools for academic integrity validation.

โœจ Key Features

  • ๐Ÿงฉ Modular Architecture: Pluggable lenses and analyzers for different content types
  • ๐Ÿ“š Academic Integrity: Citation-reference validation, bibliography checking, URL verification, AI conversation analysis
  • ๐Ÿค– AI Conversation Analysis: Cognitive intent classification for AI-assisted learning assessment
  • ๐Ÿ” Multiple Analysis Modes: Research and assessment workflows
  • ๐Ÿ“„ Rich Content Support: Text, images, documents, repositories, presentations, spreadsheets, AI conversations
  • ๐ŸŽฏ Rubric-Based Assessment: Custom rubrics for structured evaluation
  • ๐Ÿ›ก๏ธ Security First: Input sanitization, URL validation, malicious content detection
  • ๐Ÿง  Intelligent Analysis: Pattern detection, quality scoring, integrity validation, learning pattern recognition
  • ๐Ÿ’ป Multiple Interfaces: CLI, Python API, and Web API
  • ๐Ÿ”ง Modern Python: Built with uv, ruff, mypy, and pytest

Installation

From PyPI

pip install extracta

From Source

git clone https://github.com/michaelborck-education/extracta.git
cd extracta
pip install -e .

Optional Dependencies

Install with specific feature support:

pip install extracta[audio]     # Audio processing (faster-whisper for Apple Silicon)
pip install extracta[video]     # Video processing
pip install extracta[text]      # Enhanced text analysis (spaCy, NLTK)
pip install extracta[image]     # Image analysis with OCR
pip install extracta[code]      # Code analysis
pip install extracta[citation]  # Academic integrity (CrossRef, URL validation)
pip install extracta[conversation]  # AI conversation analysis (Gemini default)
pip install extracta[openai]    # OpenAI LLM provider
pip install extracta[claude]    # Anthropic Claude LLM provider
pip install extracta[openrouter] # OpenRouter unified API
pip install extracta[api]       # Web API server (FastAPI, Uvicorn)
pip install extracta[all]       # All features

Usage

Command Line

Basic Content Analysis

# Analyze document for research insights
extracta analyze research_paper.pdf --mode research --output analysis.json

# Assess student submission quality
extracta analyze essay.docx --mode assessment --output feedback.json

# Analyze repository structure and content
extracta analyze https://github.com/user/repo --mode assessment

Academic Integrity Checking

# Comprehensive citation and reference validation
extracta citation analyze student_paper.pdf --output integrity_check.json

# AI conversation cognitive intent analysis (with different LLM providers)
extracta citation conversation chatgpt_export.json --provider gemini --output analysis.json
extracta citation conversation chat.json --provider claude --model claude-3-sonnet-20240229
extracta citation conversation chat.json --provider openai --model gpt-4
extracta citation conversation chat.json --provider openrouter --model anthropic/claude-3-haiku

# Results include:
# - Citation-reference relationship validation
# - Bibliography padding detection
# - URL accessibility and domain reputation
# - AI conversation learning pattern analysis
# - Academic integrity scoring

Python API

Basic Content Analysis

from extracta import TextAnalyzer

analyzer = TextAnalyzer()
result = analyzer.analyze(text_content, mode="research")
print(result)

Academic Integrity Analysis

from extracta.analyzers import CitationAnalyzer, ReferenceAnalyzer, URLAnalyzer, ConversationAnalyzer

# Citation-reference validation
citation_analyzer = CitationAnalyzer()
citation_result = citation_analyzer.analyze(document_text)

# Bibliography quality assessment
reference_analyzer = ReferenceAnalyzer()
reference_result = reference_analyzer.analyze(document_text)

# URL validation and reputation checking
url_analyzer = URLAnalyzer()
url_result = url_analyzer.analyze(document_text)

# AI conversation cognitive intent analysis (with different providers)
conversation_analyzer = ConversationAnalyzer(provider="claude", model="claude-3-sonnet-20240229")
conversation_result = conversation_analyzer.analyze(conversation_json_data)

# Or use OpenAI
conversation_analyzer = ConversationAnalyzer(provider="openai", model="gpt-4")
conversation_result = conversation_analyzer.analyze(conversation_json_data)

# Combined integrity score
integrity_score = citation_result['citation_analysis']['academic_integrity_score']
learning_quality = conversation_result['conversation_analysis']['learning_assessment']['learning_quality_score']
print(f"Academic Integrity Score: {integrity_score}/100")
print(f"AI Learning Quality Score: {learning_quality}/100")

Grading and Assessment

from extracta.grading.rubric_manager import RubricRepository, get_default_rubric
from extracta.grading.feedback_generator import FeedbackGenerator

# Load or create a rubric
repo = RubricRepository("rubrics")
rubric = get_default_rubric("academic")  # or repo.load("my-rubric")

# Generate feedback based on analysis results
generator = FeedbackGenerator()
feedback = generator.generate_feedback(
    rubric=rubric,
    analysis_data=analysis_result,
    audience="student",
    detail="detailed"
)

๐ŸŽ“ Academic Integrity Features

Extracta provides comprehensive tools for detecting academic integrity issues and validating scholarly work:

Citation Analysis

  • Citation-Reference Validation: Ensures all references have corresponding in-text citations
  • Bibliography Padding Detection: Identifies references without citations
  • Citation Stuffing Detection: Flags excessive citations in single sentences
  • Style Recognition: Supports APA, MLA, Chicago, Harvard, and Numeric styles

Reference Validation

  • DOI Verification: Validates Digital Object Identifiers with CrossRef API
  • URL Accessibility: Checks if referenced URLs are accessible (404 detection)
  • Domain Reputation: Analyzes source credibility (academic vs. commercial domains)
  • Format Validation: Ensures proper reference formatting and completeness

AI Conversation Analysis

  • Cognitive Intent Classification: Uses configurable LLM to classify user prompts as Delegation vs. Scaffolding
  • Multi-Provider Support: Gemini, OpenAI GPT, Anthropic Claude, OpenRouter unified API
  • Learning Pattern Recognition: Analyzes conversation flow for active learning behaviors
  • Session Quality Scoring: Provides learning quality assessment (0-100)
  • Platform Support: ChatGPT, Claude, Bard, and generic conversation formats

Security & Privacy

  • Input Sanitization: Detects and prevents malicious content, hidden text, and LLM jailbreaks
  • URL Protection: SSRF prevention with academic domain whitelisting
  • Content Validation: Size limits, encoding validation, and integrity checking
  • Privacy First: No data persistence, user-controlled processing, ephemeral analysis
  • Safe Processing: Static analysis only, no code execution or external script running

Repository Analysis

  • WordPress Detection: Identifies WordPress projects and analyzes themes/plugins
  • Code Quality Assessment: Evaluates repository structure and practices
  • File Type Analysis: Comprehensive analysis of all repository contents

Integrity Scoring

  • Academic Integrity Score: 0-100 scale based on multiple validation criteria
  • Detailed Reporting: Specific issues and recommendations
  • Pattern Detection: Identifies suspicious citation and reference patterns

Development

Setup

# Clone repository
git clone https://github.com/michaelborck-education/extracta.git
cd extracta

# Create virtual environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -e ".[dev]"

Testing

# Run tests
pytest

# With coverage
pytest --cov=extracta

Linting and Type Checking

# Lint with ruff
ruff check .

# Type check with mypy
mypy extracta

# Format code
ruff format .

Building and Publishing

# Build package
uv build

# Publish to PyPI
uv venv  # if not already
source .venv/bin/activate
uv pip install twine
twine upload dist/* --repository pypi

Project Structure

extracta/
โ”œโ”€โ”€ extracta/
โ”‚   โ”œโ”€โ”€ lenses/              # Content extraction modules
โ”‚   โ”‚   โ”œโ”€โ”€ audio_lens/      # Audio file processing
โ”‚   โ”‚   โ”œโ”€โ”€ video_lens/      # Video file processing
โ”‚   โ”‚   โ”œโ”€โ”€ image_lens/      # Image processing with OCR
โ”‚   โ”‚   โ”œโ”€โ”€ document_lens/   # Text & Office document processing
โ”‚   โ”‚   โ”œโ”€โ”€ presentation_lens/ # Presentation file analysis
โ”‚   โ”‚   โ”œโ”€โ”€ repo_lens/       # Repository-level analysis
โ”‚   โ”‚   โ””โ”€โ”€ base_lens.py     # Common lens interface
โ”‚   โ”œโ”€โ”€ analyzers/           # Content analysis modules
โ”‚   โ”‚   โ”œโ”€โ”€ text_analyzer/   # Text quality and readability
โ”‚   โ”‚   โ”œโ”€โ”€ image_analyzer/  # Image quality assessment
โ”‚   โ”‚   โ”œโ”€โ”€ citation_analyzer/ # Citation-reference validation
โ”‚   โ”‚   โ”œโ”€โ”€ reference_analyzer/ # Bibliography quality assessment
โ”‚   โ”‚   โ”œโ”€โ”€ url_analyzer/    # URL validation and reputation
โ”‚   โ”‚   โ””โ”€โ”€ base_analyzer.py # Common analyzer interface
โ”‚   โ”œโ”€โ”€ grading/             # Assessment and grading
โ”‚   โ”‚   โ”œโ”€โ”€ rubric_manager/  # Rubric creation and management
โ”‚   โ”‚   โ””โ”€โ”€ feedback_generator.py # AI-powered feedback
โ”‚   โ”œโ”€โ”€ orchestration/       # Workflow management
โ”‚   โ”œโ”€โ”€ shared/              # Common utilities
โ”‚   โ””โ”€โ”€ cli/                 # Command-line interface
โ”œโ”€โ”€ tests/                   # Test suite
โ”œโ”€โ”€ docs/                    # Documentation
โ”œโ”€โ”€ examples/                # Usage examples
โ”œโ”€โ”€ pyproject.toml           # Package configuration
โ””โ”€โ”€ README.md               # This file

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Run the test suite
  6. Submit a pull request

License

MIT License - see LICENSE file for details.

๐Ÿš€ Current Status & Roadmap

โœ… Implemented Features

  • Text Analysis: Readability, sentiment, vocabulary, quality metrics
  • Image Analysis: OCR, quality assessment, accessibility
  • Document Processing: PDF, DOCX, Office docs (PPTX, Excel, CSV)
  • Citation Validation: Citation-reference relationships, academic integrity
  • Reference Analysis: Bibliography quality, DOI validation, CrossRef integration
  • URL Validation: Accessibility checking, domain reputation, robots.txt
  • AI Conversation Analysis: Cognitive intent classification, learning pattern recognition
  • Repository Analysis: GitHub repo analysis, WordPress detection
  • Rubric System: Custom rubrics, structured assessment
  • CLI Interface: Multiple commands for different analysis types
  • Web API: REST API for integration
  • Python API: Programmatic access

๐Ÿ”„ In Development

  • Audio Lens: Speech-to-text, audio quality analysis
  • Video Lens: Frame analysis, transcript processing
  • Code Analyzer: Code quality metrics, best practices
  • Screenshot Integration: Visual URL validation
  • Wayback Machine: Archive URL checking

๐Ÿ“‹ Future Enhancements

  • URL Conversation Input: Direct analysis of conversations from URLs (ChatGPT share links, etc.)
  • GUI Application: Web-based interface
  • LMS Integration: Canvas, Blackboard, Moodle
  • Advanced ML Models: Fine-tuned for educational content
  • Collaborative Features: Multi-user assessment workflows
  • Plugin Architecture: Custom lenses and analyzers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracta-0.3.0.tar.gz (108.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extracta-0.3.0-py3-none-any.whl (107.1 kB view details)

Uploaded Python 3

File details

Details for the file extracta-0.3.0.tar.gz.

File metadata

  • Download URL: extracta-0.3.0.tar.gz
  • Upload date:
  • Size: 108.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for extracta-0.3.0.tar.gz
Algorithm Hash digest
SHA256 6628b532375adf3112e0c0e0540ce33b4b9e7f37b365a0905b070acfa2625c47
MD5 7e8b843dae5ca013df84da78549eac76
BLAKE2b-256 4b91668ca35fd8b8e588aadd331bc02049b98a72f8109d869fd61e9ca0a5ccc6

See more details on using hashes here.

File details

Details for the file extracta-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: extracta-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 107.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for extracta-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d164f8d254418245268734fe1890314b30a04f74f4352d3874bcdbb745d51497
MD5 e36cefbc69afcab7e6d24078d677fd8a
BLAKE2b-256 12a3d899a30f7e3360f5e9786b9a14a7de856619ca1dd696c2192c72dcaac1c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page