Intelligent content deduplication and clustering toolkit with multilingual support

These details have not been verified by PyPI

Project description

content-dedup

Intelligent content deduplication and clustering toolkit with multilingual support, designed for efficient pipeline integration and production use.

py-content-dedup is a powerful Python library and command-line tool that automatically detects and groups similar content items using advanced text similarity algorithms. It supports multiple languages, handles mixed-language content (like Chinese-English), and provides both programmatic API and CLI interfaces for seamless integration into data processing pipelines.

✨ Features

🌍 Multilingual Support: Auto-detection and processing of Chinese, English, Japanese, and mixed-language content
🔗 Smart Clustering: Groups similar content using TF-IDF, cosine similarity, and advanced text processing
🛠️ Pipeline Ready: JSON output format with progress logging to stderr (stdout stays clean)
⚡ High Performance: Optimized field mapping with minimal, balanced, and full mapping options
🎯 Flexible Output: Support for both full clusters and representative-only output
📊 Detailed Reports: Comprehensive statistics and language distribution analysis
🔧 Streamlined API: Simplified field mapping focused on core deduplication functionality

🚀 Quick Start

Installation

Basic Installation

pip install content-dedup

Full Installation (Recommended)

For complete multilingual support including Chinese text processing:

pip install content-dedup[full]

Development Installation

git clone https://github.com/changyy/py-content-dedup.git
cd py-content-dedup
pip install -e .[dev,full]

Command Line Usage

# Basic deduplication with auto language detection
content-dedup input.jsonl --output results.json

# Specify language and similarity threshold
content-dedup input.jsonl --language zh --similarity 0.85 --output clusters.json

# Use predefined field mapping for different data formats
content-dedup news_data.jsonl --field-mapping news --output results.json
content-dedup blog_data.jsonl --field-mapping blog --output results.json
content-dedup social_data.jsonl --field-mapping social --output results.json

# Custom field mapping with optimized fields
content-dedup data.jsonl \
  --title-field headline \
  --content-fields body,summary \
  --id-field permalink \
  --category-field tags \
  --output results.json

# Multiple content fields with custom separator
content-dedup data.jsonl \
  --content-fields description,content,text \
  --content-separator ' | ' \
  --output results.json

# Minimal mapping for high performance
content-dedup data.jsonl \
  --title-field headline \
  --content-fields body,summary \
  --id-field permalink \
  --minimal \
  --output results.json

# Pipeline usage with representatives only
content-dedup input.jsonl --format representatives --field-mapping news | jq '.[] | .title'

# Handle missing fields gracefully
content-dedup incomplete_data.jsonl --ignore-missing --title-field title,name --output results.json

# Save both clusters and representatives
content-dedup input.jsonl --output clusters.json --representatives reps.jsonl

# Verbose mode with progress logging
content-dedup input.jsonl --output results.json --verbose --log-file progress.log

Python API Usage

from content_dedup import ContentDeduplicator

# Initialize deduplicator
deduplicator = ContentDeduplicator(
    language='auto',           # Auto-detect language
    similarity_threshold=0.8,  # Similarity threshold for clustering
    mixed_language_threshold=0.3,  # Mixed language detection threshold
    field_mapping='news'       # Use predefined field mapping
)

# Load and process data
deduplicator.load_jsonl('input.jsonl')
clusters = deduplicator.cluster_and_deduplicate()

# Get representatives only
representatives = deduplicator.get_representatives()

# Generate detailed report
report = deduplicator.generate_report()
print(f"Processed {report['basic_statistics']['original_content_count']} items into {len(clusters)} clusters")

🗂️ Flexible Field Mapping

py-content-dedup supports flexible field mapping to handle various JSONL formats without requiring data transformation. We provide three mapping types to balance functionality and simplicity:

🎯 Field Mapping Options

The library offers optimized field mapping focused on core deduplication functionality:

Standard Mapping (5 fields): Balanced performance and quality - includes title, content, ID, category, and publish time
Minimal Mapping (3 fields): Maximum performance - only title, content, and ID for high-speed processing

Using Predefined Mappings

from content_dedup import ContentDeduplicator

# Standard mappings (5 fields)
deduplicator = ContentDeduplicator(field_mapping='news')
deduplicator = ContentDeduplicator(field_mapping='blog')

# Balanced mappings (5 essential fields) - RECOMMENDED
deduplicator = ContentDeduplicator(field_mapping='balanced-news')
deduplicator = ContentDeduplicator(field_mapping='balanced-blog')

# Minimal mappings (3 core fields) - HIGH PERFORMANCE
deduplicator = ContentDeduplicator(field_mapping='minimal-news')
deduplicator = ContentDeduplicator(field_mapping='minimal-blog')

# Available presets: 'news', 'blog', 'social', 'academic', 'ecommerce'
# Each with 'balanced-' and 'minimal-' variants

Custom Field Mapping

# Full mapping (all features)
from content_dedup.config.field_mapping import create_custom_mapping
full_mapping = create_custom_mapping(
    title_field='headline',
    content_fields=['body', 'summary'],
    id_field='permalink',
    category_field='tags',
    publish_time_field='published_at',
    content_separator=' | '
)

# Balanced mapping (recommended) - core + important fields
from content_dedup.config.field_mapping_balanced import create_balanced_custom_mapping
balanced_mapping = create_balanced_custom_mapping(
    title_field='headline',
    content_fields=['body', 'summary'],
    id_field='permalink',
    category_field='tags',
    publish_time_field='published_at'
)

# Minimal mapping (performance) - core fields only
from content_dedup.config.field_mapping_minimal import create_minimal_custom_mapping
minimal_mapping = create_minimal_custom_mapping(
    title_field='headline',
    content_fields=['body', 'summary'],
    id_field='permalink'
)

# Use with ContentDeduplicator
deduplicator = ContentDeduplicator(field_mapping=balanced_mapping)  # Recommended

CLI Field Mapping

# Use predefined mappings
content-dedup data.jsonl --field-mapping news --output results.json
content-dedup data.jsonl --field-mapping balanced-news --output results.json  # Recommended
content-dedup data.jsonl --field-mapping minimal-news --output results.json   # High performance

# Custom field specification (creates balanced mapping)
content-dedup data.jsonl \
  --title-field headline \
  --content-fields body,summary \
  --id-field permalink \
  --category-field tags \
  --output results.json

# Minimal specification for performance
content-dedup data.jsonl \
  --title-field headline \
  --content-fields body,summary \
  --id-field permalink \
  --minimal \
  --output results.json

📊 Input Format

Standard Format

Input should be in JSONL format with the following structure:

{
  "title": "Article title",
  "content_text": "Full article content...",
  "url": "https://example.com/article",
  "category": ["news", "technology"],
  "publish_time": "2025/01/15 10:30:00"
}

Custom Formats

With field mapping, you can use any JSONL structure:

// News format
{
  "headline": "Breaking News Title",
  "body": "News content...",
  "summary": "Brief summary...",
  "permalink": "https://news.site/article",
  "tags": ["breaking", "politics"]
}

// Blog format  
{
  "post_title": "Blog Post Title",
  "content": "Blog content...",
  "description": "Post description...",
  "blog_url": "https://blog.site/post",
  "categories": ["tech", "tutorial"]
}

// Social format
{
  "username": "social_user",
  "message": "Social media post content...",
  "post_url": "https://social.site/post/123",
  "timestamp": "2025-01-15T10:30:00Z"
}

📋 Output Formats

Clusters Format (Default)

{
  "metadata": {
    "total_clusters": 150,
    "original_count": 1000,
    "language_settings": "auto",
    "similarity_threshold": 0.8,
    "compression_ratio": 0.15
  },
  "clusters": [
    {
      "cluster_id": "cluster_0001",
      "representative": { /* ContentItem */ },
      "members": [ /* Array of ContentItems */ ],
      "member_count": 5,
      "dominant_language": "zh",
      "language_distribution": {"zh": 0.8, "en": 0.2}
    }
  ]
}

Representatives Format

[
  { /* Representative ContentItem 1 */ },
  { /* Representative ContentItem 2 */ },
  { /* Representative ContentItem 3 */ }
]

Report Format

{
  "processing_settings": {
    "language_mode": "auto",
    "similarity_threshold": 0.8
  },
  "basic_statistics": {
    "original_content_count": 1000,
    "cluster_count": 150,
    "compression_ratio": "85.00%"
  },
  "language_distribution": {
    "original_language_stats": {"zh": 800, "en": 150, "mixed": 50},
    "mixed_language_cluster_count": 12
  }
}

⚙️ Configuration Options

CLI Arguments

Argument	Type	Default	Description
`--language`	str	`auto`	Language mode: `auto`, `zh`, `en`, `mixed`
`--similarity`	float	`0.8`	Similarity threshold for clustering
`--mixed-threshold`	float	`0.3`	Mixed language detection threshold
`--format`	str	`clusters`	Output format: `clusters`, `representatives`, `report`
`--output`	str	-	Output file (JSON). If not specified, outputs to stdout
`--representatives`	str	-	Separate file for representatives (JSONL)
`--pretty`	bool	`False`	Pretty print JSON output
`--verbose`	bool	`False`	Enable verbose logging
`--log-file`	str	-	Log file path (default: stderr)
`--no-progress`	bool	`False`	Disable progress reporting

Language Processing

The tool automatically detects and handles:

Chinese (zh): Uses jieba for word segmentation (requires [chinese] or [full] installation)
English (en): Uses whitespace and punctuation-based tokenization
Mixed Language: Intelligently separates and processes Chinese-English mixed content
Auto Detection: Analyzes character distribution and content patterns
Enhanced Detection: Uses langdetect library for improved accuracy (requires [langdetect] or [full] installation)

Note: For optimal Chinese text processing, install with pip install content-dedup[chinese] or pip install content-dedup[full]. The basic installation provides fallback processing for Chinese text but may have reduced accuracy.

🔧 Algorithm Details

Similarity Calculation

The tool uses a multi-dimensional similarity approach:

Title Similarity (40%): Sequence-based comparison of titles
Content Similarity (50%): TF-IDF cosine similarity of processed content
Length Similarity (10%): Relative length comparison

Clustering Process

Exact Duplicate Removal: URL and title hash-based deduplication
Language Detection: Character-based and statistical language identification
Text Processing: Language-specific tokenization and stop word removal
Similarity Matrix: Efficient batch computation of pairwise similarities
Clustering: Connected components algorithm for grouping similar items
Representative Selection: Multi-factor scoring for optimal representative selection

Representative Selection Criteria

Representatives are selected based on:

Content Quality (30%): Length and completeness
Title Quality (20%): Optimal title length and clarity
Source Reliability (20%): Domain-based credibility scoring
Timeliness (20%): Publication time freshness
Language Consistency (10%): Alignment with cluster's dominant language

🧪 Examples

Processing News Articles

# Process Chinese news with high precision
content-dedup chinese_news.jsonl \
  --language zh \
  --similarity 0.9 \
  --output clusters.json \
  --representatives news_reps.jsonl \
  --verbose

# Pipeline integration for positive news filtering
content-dedup all_news.jsonl --format representatives | \
  python filter_positive_news.py | \
  python generate_summary.py > final_output.json

Mixed Language Content

# Handle Chinese-English mixed content
content-dedup mixed_content.jsonl \
  --language auto \
  --mixed-threshold 0.2 \
  --similarity 0.85 \
  --format clusters \
  --pretty

Batch Processing

# Process multiple files
for file in data/*.jsonl; do
  echo "Processing $file..."
  content-dedup "$file" \
    --output "results/$(basename "$file" .jsonl)_clusters.json" \
    --log-file "logs/$(basename "$file" .jsonl).log"
done

Language-Specific Processing

# For Chinese content (requires chinese installation)
content-dedup chinese_articles.jsonl --language zh --output chinese_clusters.json

# For English content
content-dedup english_articles.jsonl --language en --output english_clusters.json

# For mixed language content (auto-detect)
content-dedup mixed_articles.jsonl --language auto --output mixed_clusters.json

📚 API Reference

ContentDeduplicator Class

class ContentDeduplicator:
    def __init__(self, 
                 language: str = 'auto',
                 similarity_threshold: float = 0.8,
                 mixed_language_threshold: float = 0.3,
                 field_mapping: Union[str, Any, None] = None)
    
    def load_jsonl(self, file_path: str) -> None
    def cluster_and_deduplicate(self) -> List[FlexibleContentCluster]
    def generate_report(self) -> Dict[str, Any]
    def save_results(self, output_path: str, format: str = 'clusters') -> None
    def get_representatives(self) -> List[FlexibleContentItem]
    def get_all_clusters(self) -> List[FlexibleContentCluster]

FlexibleContentCluster Class

@dataclass
class FlexibleContentCluster:
    representative: FlexibleContentItem
    members: List[FlexibleContentItem]
    cluster_id: str
    dominant_language: str
    language_distribution: Dict[str, float]
    similarity_scores: Dict[str, float]

FlexibleContentItem Class

@dataclass
class FlexibleContentItem:
    original_data: Dict[str, Any]
    working_fields: Dict[str, Any]
    language: Optional[str] = None
    field_mapping_info: Dict[str, str] = field(default_factory=dict)
    
    @property
    def title(self) -> str
    @property
    def content_text(self) -> str
    @property
    def url(self) -> str
    
    def get_working_field(self, field_name: str, default=None)
    def to_dict(self, mode: str = "working", include_metadata: bool = False) -> Dict[str, Any]

🛠️ Development

Setup Development Environment

git clone https://github.com/changyy/py-content-dedup.git
cd py-content-dedup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies with full features
pip install -e .[dev,full]

# Install pre-commit hooks
pre-commit install

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=content_dedup

# Run specific test file
pytest tests/test_deduplicator.py -v

# Run only fast tests (exclude slow integration tests)
pytest -m "not slow"

Code Quality

# Format code
black content_dedup/

# Lint code
flake8 content_dedup/

# Type checking
mypy content_dedup/

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Development Guidelines

Install development dependencies: pip install -e .[dev,full]
Run tests before submitting: pytest
Follow code formatting: black content_dedup/
Add tests for new features
Update documentation as needed

💡 Troubleshooting

Common Issues

Issue: ImportError for jieba when processing Chinese text

# Solution: Install Chinese language support
pip install content-dedup[chinese]

Issue: Poor language detection accuracy

# Solution: Install enhanced language detection
pip install content-dedup[langdetect]

Issue: Reduced performance with English text

# Solution: Install advanced English processing
pip install content-dedup[english]

Issue: Missing features or dependencies

# Solution: Install all features
pip install content-dedup[all]

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

scikit-learn for machine learning algorithms
jieba for Chinese word segmentation
langdetect for language detection
NLTK for natural language processing

📞 Support

Made with ❤️ for efficient content processing

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.2

Jun 4, 2025

1.1.1

Jun 4, 2025

1.1.0

May 28, 2025

1.0.0

May 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

content_dedup-1.1.2.tar.gz (86.5 kB view details)

Uploaded Jun 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

content_dedup-1.1.2-py3-none-any.whl (69.8 kB view details)

Uploaded Jun 4, 2025 Python 3

File details

Details for the file content_dedup-1.1.2.tar.gz.

File metadata

Download URL: content_dedup-1.1.2.tar.gz
Upload date: Jun 4, 2025
Size: 86.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for content_dedup-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`79c748a539c5fa2d13badf1b47a0e3cb3857093281b592331daebf5fbb50c300`
MD5	`518b83fb46ea3d44c4e0db4b0d824ce2`
BLAKE2b-256	`5a711547bda001de58ec82749842f8f08eff5213101d09fbf2ffc39fe7e2097f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for content_dedup-1.1.2.tar.gz:

Publisher: python-publish.yml on changyy/py-content-dedup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: content_dedup-1.1.2.tar.gz
- Subject digest: 79c748a539c5fa2d13badf1b47a0e3cb3857093281b592331daebf5fbb50c300
- Sigstore transparency entry: 229907115
- Sigstore integration time: Jun 4, 2025
Source repository:
- Permalink: changyy/py-content-dedup@55045b69da0752b53c723bafe95daf4015515703
- Branch / Tag: refs/tags/v1.1.2
- Owner: https://github.com/changyy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@55045b69da0752b53c723bafe95daf4015515703
- Trigger Event: release

File details

Details for the file content_dedup-1.1.2-py3-none-any.whl.

File metadata

Download URL: content_dedup-1.1.2-py3-none-any.whl
Upload date: Jun 4, 2025
Size: 69.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for content_dedup-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1b9505ce29acda7e1fe54560c10bcec06e5c070346186009875f24ca07822b69`
MD5	`01d9ec5bc8a5c6f02e8e6aefb52e69eb`
BLAKE2b-256	`4f2b007609ecee2b80de5827f467cc95c200ef9446dcf6e665ed39d3c02e0b8f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for content_dedup-1.1.2-py3-none-any.whl:

Publisher: python-publish.yml on changyy/py-content-dedup

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: content_dedup-1.1.2-py3-none-any.whl
- Subject digest: 1b9505ce29acda7e1fe54560c10bcec06e5c070346186009875f24ca07822b69
- Sigstore transparency entry: 229907116
- Sigstore integration time: Jun 4, 2025
Source repository:
- Permalink: changyy/py-content-dedup@55045b69da0752b53c723bafe95daf4015515703
- Branch / Tag: refs/tags/v1.1.2
- Owner: https://github.com/changyy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@55045b69da0752b53c723bafe95daf4015515703
- Trigger Event: release

content-dedup 1.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

content-dedup

✨ Features

🚀 Quick Start

Installation

Basic Installation

Full Installation (Recommended)

Development Installation

Command Line Usage

Python API Usage

🗂️ Flexible Field Mapping

🎯 Field Mapping Options

Using Predefined Mappings

Custom Field Mapping

CLI Field Mapping

📊 Input Format

Standard Format

Custom Formats

📋 Output Formats

Clusters Format (Default)

Representatives Format

Report Format

⚙️ Configuration Options

CLI Arguments

Language Processing

🔧 Algorithm Details

Similarity Calculation

Clustering Process

Representative Selection Criteria

🧪 Examples

Processing News Articles

Mixed Language Content

Batch Processing

Language-Specific Processing

📚 API Reference

ContentDeduplicator Class

FlexibleContentCluster Class

FlexibleContentItem Class

🛠️ Development

Setup Development Environment

Running Tests

Code Quality

🤝 Contributing

Development Guidelines

💡 Troubleshooting

Common Issues

📄 License

🙏 Acknowledgments

📞 Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance