Modular content analysis platform for research, assessment, and academic integrity checking
Project description
Extracta
Modular Content Analysis Platform for research, assessment, and academic integrity checking.
Extracta provides a unified interface for extracting and analyzing content from diverse media types including documents, images, repositories, and web content. It supports both research-focused deep analysis and assessment-oriented quality evaluation, with specialized tools for academic integrity validation.
โจ Key Features
- ๐งฉ Modular Architecture: Pluggable lenses and analyzers for different content types
- ๐ Academic Integrity: Citation-reference validation, bibliography checking, URL verification, AI conversation analysis
- ๐ค AI Conversation Analysis: Cognitive intent classification for AI-assisted learning assessment
- ๐ Multiple Analysis Modes: Research and assessment workflows
- ๐ Rich Content Support: Text, images, documents, repositories, presentations, spreadsheets, AI conversations
- ๐ฏ Rubric-Based Assessment: Custom rubrics for structured evaluation
- ๐ก๏ธ Security First: Input sanitization, URL validation, malicious content detection
- ๐ง Intelligent Analysis: Pattern detection, quality scoring, integrity validation, learning pattern recognition
- ๐ป Multiple Interfaces: CLI, Python API, and Web API
- ๐ง Modern Python: Built with uv, ruff, mypy, and pytest
Installation
From PyPI
pip install extracta
From Source
git clone https://github.com/michaelborck-education/extracta.git
cd extracta
pip install -e .
Optional Dependencies
Install with specific feature support:
pip install extracta[audio] # Audio processing (faster-whisper for Apple Silicon)
pip install extracta[video] # Video processing
pip install extracta[text] # Enhanced text analysis (spaCy, NLTK)
pip install extracta[image] # Image analysis with OCR
pip install extracta[code] # Code analysis
pip install extracta[citation] # Academic integrity (CrossRef, URL validation)
pip install extracta[conversation] # AI conversation analysis (Gemini default)
pip install extracta[openai] # OpenAI LLM provider
pip install extracta[claude] # Anthropic Claude LLM provider
pip install extracta[openrouter] # OpenRouter unified API
pip install extracta[api] # Web API server (FastAPI, Uvicorn)
pip install extracta[all] # All features
Usage
Command Line
Basic Content Analysis
# Analyze document for research insights
extracta analyze research_paper.pdf --mode research --output analysis.json
# Assess student submission quality
extracta analyze essay.docx --mode assessment --output feedback.json
# Analyze repository structure and content
extracta analyze https://github.com/user/repo --mode assessment
Academic Integrity Checking
# Comprehensive citation and reference validation
extracta citation analyze student_paper.pdf --output integrity_check.json
# AI conversation cognitive intent analysis (with different LLM providers)
extracta citation conversation chatgpt_export.json --provider gemini --output analysis.json
extracta citation conversation chat.json --provider claude --model claude-3-sonnet-20240229
extracta citation conversation chat.json --provider openai --model gpt-4
extracta citation conversation chat.json --provider openrouter --model anthropic/claude-3-haiku
# Results include:
# - Citation-reference relationship validation
# - Bibliography padding detection
# - URL accessibility and domain reputation
# - AI conversation learning pattern analysis
# - Academic integrity scoring
Python API
Basic Content Analysis
from extracta import TextAnalyzer
analyzer = TextAnalyzer()
result = analyzer.analyze(text_content, mode="research")
print(result)
Academic Integrity Analysis
from extracta.analyzers import CitationAnalyzer, ReferenceAnalyzer, URLAnalyzer, ConversationAnalyzer
# Citation-reference validation
citation_analyzer = CitationAnalyzer()
citation_result = citation_analyzer.analyze(document_text)
# Bibliography quality assessment
reference_analyzer = ReferenceAnalyzer()
reference_result = reference_analyzer.analyze(document_text)
# URL validation and reputation checking
url_analyzer = URLAnalyzer()
url_result = url_analyzer.analyze(document_text)
# AI conversation cognitive intent analysis (with different providers)
conversation_analyzer = ConversationAnalyzer(provider="claude", model="claude-3-sonnet-20240229")
conversation_result = conversation_analyzer.analyze(conversation_json_data)
# Or use OpenAI
conversation_analyzer = ConversationAnalyzer(provider="openai", model="gpt-4")
conversation_result = conversation_analyzer.analyze(conversation_json_data)
# Combined integrity score
integrity_score = citation_result['citation_analysis']['academic_integrity_score']
learning_quality = conversation_result['conversation_analysis']['learning_assessment']['learning_quality_score']
print(f"Academic Integrity Score: {integrity_score}/100")
print(f"AI Learning Quality Score: {learning_quality}/100")
Grading and Assessment
from extracta.grading.rubric_manager import RubricRepository, get_default_rubric
from extracta.grading.feedback_generator import FeedbackGenerator
# Load or create a rubric
repo = RubricRepository("rubrics")
rubric = get_default_rubric("academic") # or repo.load("my-rubric")
# Generate feedback based on analysis results
generator = FeedbackGenerator()
feedback = generator.generate_feedback(
rubric=rubric,
analysis_data=analysis_result,
audience="student",
detail="detailed"
)
๐ Academic Integrity Features
Extracta provides comprehensive tools for detecting academic integrity issues and validating scholarly work:
Citation Analysis
- Citation-Reference Validation: Ensures all references have corresponding in-text citations
- Bibliography Padding Detection: Identifies references without citations
- Citation Stuffing Detection: Flags excessive citations in single sentences
- Style Recognition: Supports APA, MLA, Chicago, Harvard, and Numeric styles
Reference Validation
- DOI Verification: Validates Digital Object Identifiers with CrossRef API
- URL Accessibility: Checks if referenced URLs are accessible (404 detection)
- Domain Reputation: Analyzes source credibility (academic vs. commercial domains)
- Format Validation: Ensures proper reference formatting and completeness
AI Conversation Analysis
- Cognitive Intent Classification: Uses configurable LLM to classify user prompts as Delegation vs. Scaffolding
- Multi-Provider Support: Gemini, OpenAI GPT, Anthropic Claude, OpenRouter unified API
- Learning Pattern Recognition: Analyzes conversation flow for active learning behaviors
- Session Quality Scoring: Provides learning quality assessment (0-100)
- Platform Support: ChatGPT, Claude, Bard, and generic conversation formats
Security & Privacy
- Input Sanitization: Detects and prevents malicious content, hidden text, and LLM jailbreaks
- URL Protection: SSRF prevention with academic domain whitelisting
- Content Validation: Size limits, encoding validation, and integrity checking
- Privacy First: No data persistence, user-controlled processing, ephemeral analysis
- Safe Processing: Static analysis only, no code execution or external script running
Repository Analysis
- WordPress Detection: Identifies WordPress projects and analyzes themes/plugins
- Code Quality Assessment: Evaluates repository structure and practices
- File Type Analysis: Comprehensive analysis of all repository contents
Integrity Scoring
- Academic Integrity Score: 0-100 scale based on multiple validation criteria
- Detailed Reporting: Specific issues and recommendations
- Pattern Detection: Identifies suspicious citation and reference patterns
Development
Setup
# Clone repository
git clone https://github.com/michaelborck-education/extracta.git
cd extracta
# Create virtual environment
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install -e ".[dev]"
Testing
# Run tests
pytest
# With coverage
pytest --cov=extracta
Linting and Type Checking
# Lint with ruff
ruff check .
# Type check with mypy
mypy extracta
# Format code
ruff format .
Building and Publishing
# Build package
uv build
# Publish to PyPI
uv venv # if not already
source .venv/bin/activate
uv pip install twine
twine upload dist/* --repository pypi
Project Structure
extracta/
โโโ extracta/
โ โโโ lenses/ # Content extraction modules
โ โ โโโ audio_lens/ # Audio file processing
โ โ โโโ video_lens/ # Video file processing
โ โ โโโ image_lens/ # Image processing with OCR
โ โ โโโ document_lens/ # Text & Office document processing
โ โ โโโ presentation_lens/ # Presentation file analysis
โ โ โโโ repo_lens/ # Repository-level analysis
โ โ โโโ base_lens.py # Common lens interface
โ โโโ analyzers/ # Content analysis modules
โ โ โโโ text_analyzer/ # Text quality and readability
โ โ โโโ image_analyzer/ # Image quality assessment
โ โ โโโ citation_analyzer/ # Citation-reference validation
โ โ โโโ reference_analyzer/ # Bibliography quality assessment
โ โ โโโ url_analyzer/ # URL validation and reputation
โ โ โโโ base_analyzer.py # Common analyzer interface
โ โโโ grading/ # Assessment and grading
โ โ โโโ rubric_manager/ # Rubric creation and management
โ โ โโโ feedback_generator.py # AI-powered feedback
โ โโโ orchestration/ # Workflow management
โ โโโ shared/ # Common utilities
โ โโโ cli/ # Command-line interface
โโโ tests/ # Test suite
โโโ docs/ # Documentation
โโโ examples/ # Usage examples
โโโ pyproject.toml # Package configuration
โโโ README.md # This file
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Run the test suite
- Submit a pull request
License
MIT License - see LICENSE file for details.
๐ Current Status & Roadmap
โ Implemented Features
- Text Analysis: Readability, sentiment, vocabulary, quality metrics
- Image Analysis: OCR, quality assessment, accessibility
- Document Processing: PDF, DOCX, Office docs (PPTX, Excel, CSV)
- Citation Validation: Citation-reference relationships, academic integrity
- Reference Analysis: Bibliography quality, DOI validation, CrossRef integration
- URL Validation: Accessibility checking, domain reputation, robots.txt
- AI Conversation Analysis: Cognitive intent classification, learning pattern recognition
- Repository Analysis: GitHub repo analysis, WordPress detection
- Rubric System: Custom rubrics, structured assessment
- CLI Interface: Multiple commands for different analysis types
- Web API: REST API for integration
- Python API: Programmatic access
๐ In Development
- Audio Lens: Speech-to-text, audio quality analysis
- Video Lens: Frame analysis, transcript processing
- Code Analyzer: Code quality metrics, best practices
- Screenshot Integration: Visual URL validation
- Wayback Machine: Archive URL checking
๐ Future Enhancements
- URL Conversation Input: Direct analysis of conversations from URLs (ChatGPT share links, etc.)
- GUI Application: Web-based interface
- LMS Integration: Canvas, Blackboard, Moodle
- Advanced ML Models: Fine-tuned for educational content
- Collaborative Features: Multi-user assessment workflows
- Plugin Architecture: Custom lenses and analyzers
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extracta-0.2.2.tar.gz.
File metadata
- Download URL: extracta-0.2.2.tar.gz
- Upload date:
- Size: 111.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a28f11afa7606a1a9c51debdd02bc26c89388a2fed2dad84619bc687e8411dc
|
|
| MD5 |
607f51f7e8e0652f7450a7b56b6a9154
|
|
| BLAKE2b-256 |
59a044e11ff9b6301184e880ed6123e54868c1b4f46197597e579025a4609aee
|
File details
Details for the file extracta-0.2.2-py3-none-any.whl.
File metadata
- Download URL: extracta-0.2.2-py3-none-any.whl
- Upload date:
- Size: 111.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c77d41c42cd9c75a8263b7fe807c470b32c8f7d7411120efe7efc2e26bd6301
|
|
| MD5 |
5a4d81abe967c0a6f5bd5df5f1cef490
|
|
| BLAKE2b-256 |
893f57bf559a62dd5daebb7a42ab18d583df69469d90fc7714f0eed2e3dae9eb
|