Skip to main content

Modular content analysis platform for research, assessment, and academic integrity checking

Project description

Extracta

Modular Content Analysis Platform for research, assessment, and academic integrity checking.

Extracta provides a unified interface for extracting and analyzing content from diverse media types including documents, images, repositories, and web content. It supports both research-focused deep analysis and assessment-oriented quality evaluation, with specialized tools for academic integrity validation.

โœจ Key Features

  • ๐Ÿงฉ Modular Architecture: Pluggable lenses and analyzers for different content types
  • ๐Ÿ“š Academic Integrity: Citation-reference validation, bibliography checking, URL verification, AI conversation analysis
  • ๐Ÿค– AI Conversation Analysis: Cognitive intent classification for AI-assisted learning assessment
  • ๐Ÿ” Multiple Analysis Modes: Research and assessment workflows
  • ๐Ÿ“„ Rich Content Support: Text, images, documents, repositories, presentations, spreadsheets, AI conversations
  • ๐ŸŽฏ Rubric-Based Assessment: Custom rubrics for structured evaluation
  • ๐Ÿ›ก๏ธ Security First: Input sanitization, URL validation, malicious content detection
  • ๐Ÿง  Intelligent Analysis: Pattern detection, quality scoring, integrity validation, learning pattern recognition
  • ๐Ÿ’ป Multiple Interfaces: CLI, Python API, and Web API
  • ๐Ÿ”ง Modern Python: Built with uv, ruff, mypy, and pytest

Installation

From PyPI

pip install extracta

From Source

git clone https://github.com/michaelborck-education/extracta.git
cd extracta
pip install -e .

Optional Dependencies

Install with specific feature support:

pip install extracta[audio]     # Audio processing (faster-whisper for Apple Silicon)
pip install extracta[video]     # Video processing
pip install extracta[text]      # Enhanced text analysis (spaCy, NLTK)
pip install extracta[image]     # Image analysis with OCR
pip install extracta[code]      # Code analysis
pip install extracta[citation]  # Academic integrity (CrossRef, URL validation)
pip install extracta[conversation]  # AI conversation analysis (Gemini default)
pip install extracta[openai]    # OpenAI LLM provider
pip install extracta[claude]    # Anthropic Claude LLM provider
pip install extracta[openrouter] # OpenRouter unified API
pip install extracta[api]       # Web API server (FastAPI, Uvicorn)
pip install extracta[all]       # All features

Usage

Command Line

Basic Content Analysis

# Analyze document for research insights
extracta analyze research_paper.pdf --mode research --output analysis.json

# Assess student submission quality
extracta analyze essay.docx --mode assessment --output feedback.json

# Analyze repository structure and content
extracta analyze https://github.com/user/repo --mode assessment

Academic Integrity Checking

# Comprehensive citation and reference validation
extracta citation analyze student_paper.pdf --output integrity_check.json

# AI conversation cognitive intent analysis (with different LLM providers)
extracta citation conversation chatgpt_export.json --provider gemini --output analysis.json
extracta citation conversation chat.json --provider claude --model claude-3-sonnet-20240229
extracta citation conversation chat.json --provider openai --model gpt-4
extracta citation conversation chat.json --provider openrouter --model anthropic/claude-3-haiku

# Results include:
# - Citation-reference relationship validation
# - Bibliography padding detection
# - URL accessibility and domain reputation
# - AI conversation learning pattern analysis
# - Academic integrity scoring

Python API

Basic Content Analysis

from extracta import TextAnalyzer

analyzer = TextAnalyzer()
result = analyzer.analyze(text_content, mode="research")
print(result)

Academic Integrity Analysis

from extracta.analyzers import CitationAnalyzer, ReferenceAnalyzer, URLAnalyzer, ConversationAnalyzer

# Citation-reference validation
citation_analyzer = CitationAnalyzer()
citation_result = citation_analyzer.analyze(document_text)

# Bibliography quality assessment
reference_analyzer = ReferenceAnalyzer()
reference_result = reference_analyzer.analyze(document_text)

# URL validation and reputation checking
url_analyzer = URLAnalyzer()
url_result = url_analyzer.analyze(document_text)

# AI conversation cognitive intent analysis (with different providers)
conversation_analyzer = ConversationAnalyzer(provider="claude", model="claude-3-sonnet-20240229")
conversation_result = conversation_analyzer.analyze(conversation_json_data)

# Or use OpenAI
conversation_analyzer = ConversationAnalyzer(provider="openai", model="gpt-4")
conversation_result = conversation_analyzer.analyze(conversation_json_data)

# Combined integrity score
integrity_score = citation_result['citation_analysis']['academic_integrity_score']
learning_quality = conversation_result['conversation_analysis']['learning_assessment']['learning_quality_score']
print(f"Academic Integrity Score: {integrity_score}/100")
print(f"AI Learning Quality Score: {learning_quality}/100")

Grading and Assessment

from extracta.grading.rubric_manager import RubricRepository, get_default_rubric
from extracta.grading.feedback_generator import FeedbackGenerator

# Load or create a rubric
repo = RubricRepository("rubrics")
rubric = get_default_rubric("academic")  # or repo.load("my-rubric")

# Generate feedback based on analysis results
generator = FeedbackGenerator()
feedback = generator.generate_feedback(
    rubric=rubric,
    analysis_data=analysis_result,
    audience="student",
    detail="detailed"
)

๐ŸŽ“ Academic Integrity Features

Extracta provides comprehensive tools for detecting academic integrity issues and validating scholarly work:

Citation Analysis

  • Citation-Reference Validation: Ensures all references have corresponding in-text citations
  • Bibliography Padding Detection: Identifies references without citations
  • Citation Stuffing Detection: Flags excessive citations in single sentences
  • Style Recognition: Supports APA, MLA, Chicago, Harvard, and Numeric styles

Reference Validation

  • DOI Verification: Validates Digital Object Identifiers with CrossRef API
  • URL Accessibility: Checks if referenced URLs are accessible (404 detection)
  • Domain Reputation: Analyzes source credibility (academic vs. commercial domains)
  • Format Validation: Ensures proper reference formatting and completeness

AI Conversation Analysis

  • Cognitive Intent Classification: Uses configurable LLM to classify user prompts as Delegation vs. Scaffolding
  • Multi-Provider Support: Gemini, OpenAI GPT, Anthropic Claude, OpenRouter unified API
  • Learning Pattern Recognition: Analyzes conversation flow for active learning behaviors
  • Session Quality Scoring: Provides learning quality assessment (0-100)
  • Platform Support: ChatGPT, Claude, Bard, and generic conversation formats

Security & Privacy

  • Input Sanitization: Detects and prevents malicious content, hidden text, and LLM jailbreaks
  • URL Protection: SSRF prevention with academic domain whitelisting
  • Content Validation: Size limits, encoding validation, and integrity checking
  • Privacy First: No data persistence, user-controlled processing, ephemeral analysis
  • Safe Processing: Static analysis only, no code execution or external script running

Repository Analysis

  • WordPress Detection: Identifies WordPress projects and analyzes themes/plugins
  • Code Quality Assessment: Evaluates repository structure and practices
  • File Type Analysis: Comprehensive analysis of all repository contents

Integrity Scoring

  • Academic Integrity Score: 0-100 scale based on multiple validation criteria
  • Detailed Reporting: Specific issues and recommendations
  • Pattern Detection: Identifies suspicious citation and reference patterns

Development

Setup

# Clone repository
git clone https://github.com/michaelborck-education/extracta.git
cd extracta

# Create virtual environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -e ".[dev]"

Testing

# Run tests
pytest

# With coverage
pytest --cov=extracta

Linting and Type Checking

# Lint with ruff
ruff check .

# Type check with mypy
mypy extracta

# Format code
ruff format .

Building and Publishing

# Build package
uv build

# Publish to PyPI
uv venv  # if not already
source .venv/bin/activate
uv pip install twine
twine upload dist/* --repository pypi

Project Structure

extracta/
โ”œโ”€โ”€ extracta/
โ”‚   โ”œโ”€โ”€ lenses/              # Content extraction modules
โ”‚   โ”‚   โ”œโ”€โ”€ audio_lens/      # Audio file processing
โ”‚   โ”‚   โ”œโ”€โ”€ video_lens/      # Video file processing
โ”‚   โ”‚   โ”œโ”€โ”€ image_lens/      # Image processing with OCR
โ”‚   โ”‚   โ”œโ”€โ”€ document_lens/   # Text & Office document processing
โ”‚   โ”‚   โ”œโ”€โ”€ presentation_lens/ # Presentation file analysis
โ”‚   โ”‚   โ”œโ”€โ”€ repo_lens/       # Repository-level analysis
โ”‚   โ”‚   โ””โ”€โ”€ base_lens.py     # Common lens interface
โ”‚   โ”œโ”€โ”€ analyzers/           # Content analysis modules
โ”‚   โ”‚   โ”œโ”€โ”€ text_analyzer/   # Text quality and readability
โ”‚   โ”‚   โ”œโ”€โ”€ image_analyzer/  # Image quality assessment
โ”‚   โ”‚   โ”œโ”€โ”€ citation_analyzer/ # Citation-reference validation
โ”‚   โ”‚   โ”œโ”€โ”€ reference_analyzer/ # Bibliography quality assessment
โ”‚   โ”‚   โ”œโ”€โ”€ url_analyzer/    # URL validation and reputation
โ”‚   โ”‚   โ””โ”€โ”€ base_analyzer.py # Common analyzer interface
โ”‚   โ”œโ”€โ”€ grading/             # Assessment and grading
โ”‚   โ”‚   โ”œโ”€โ”€ rubric_manager/  # Rubric creation and management
โ”‚   โ”‚   โ””โ”€โ”€ feedback_generator.py # AI-powered feedback
โ”‚   โ”œโ”€โ”€ orchestration/       # Workflow management
โ”‚   โ”œโ”€โ”€ shared/              # Common utilities
โ”‚   โ””โ”€โ”€ cli/                 # Command-line interface
โ”œโ”€โ”€ tests/                   # Test suite
โ”œโ”€โ”€ docs/                    # Documentation
โ”œโ”€โ”€ examples/                # Usage examples
โ”œโ”€โ”€ pyproject.toml           # Package configuration
โ””โ”€โ”€ README.md               # This file

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Run the test suite
  6. Submit a pull request

License

MIT License - see LICENSE file for details.

๐Ÿš€ Current Status & Roadmap

โœ… Implemented Features

  • Text Analysis: Readability, sentiment, vocabulary, quality metrics
  • Image Analysis: OCR, quality assessment, accessibility
  • Document Processing: PDF, DOCX, Office docs (PPTX, Excel, CSV)
  • Citation Validation: Citation-reference relationships, academic integrity
  • Reference Analysis: Bibliography quality, DOI validation, CrossRef integration
  • URL Validation: Accessibility checking, domain reputation, robots.txt
  • AI Conversation Analysis: Cognitive intent classification, learning pattern recognition
  • Repository Analysis: GitHub repo analysis, WordPress detection
  • Rubric System: Custom rubrics, structured assessment
  • CLI Interface: Multiple commands for different analysis types
  • Web API: REST API for integration
  • Python API: Programmatic access

๐Ÿ”„ In Development

  • Audio Lens: Speech-to-text, audio quality analysis
  • Video Lens: Frame analysis, transcript processing
  • Code Analyzer: Code quality metrics, best practices
  • Screenshot Integration: Visual URL validation
  • Wayback Machine: Archive URL checking

๐Ÿ“‹ Future Enhancements

  • URL Conversation Input: Direct analysis of conversations from URLs (ChatGPT share links, etc.)
  • GUI Application: Web-based interface
  • LMS Integration: Canvas, Blackboard, Moodle
  • Advanced ML Models: Fine-tuned for educational content
  • Collaborative Features: Multi-user assessment workflows
  • Plugin Architecture: Custom lenses and analyzers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracta-0.2.2.tar.gz (111.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extracta-0.2.2-py3-none-any.whl (111.1 kB view details)

Uploaded Python 3

File details

Details for the file extracta-0.2.2.tar.gz.

File metadata

  • Download URL: extracta-0.2.2.tar.gz
  • Upload date:
  • Size: 111.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for extracta-0.2.2.tar.gz
Algorithm Hash digest
SHA256 1a28f11afa7606a1a9c51debdd02bc26c89388a2fed2dad84619bc687e8411dc
MD5 607f51f7e8e0652f7450a7b56b6a9154
BLAKE2b-256 59a044e11ff9b6301184e880ed6123e54868c1b4f46197597e579025a4609aee

See more details on using hashes here.

File details

Details for the file extracta-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: extracta-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 111.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for extracta-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5c77d41c42cd9c75a8263b7fe807c470b32c8f7d7411120efe7efc2e26bd6301
MD5 5a4d81abe967c0a6f5bd5df5f1cef490
BLAKE2b-256 893f57bf559a62dd5daebb7a42ab18d583df69469d90fc7714f0eed2e3dae9eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page