A Python package for detecting irrelevant content in text and HTML.
Project description
Irrelevant Content Detection
Irrelevant Content Detection is a Python package for detecting and cleaning irrelevant content from text and HTML. It leverages machine learning techniques such as TF-IDF and KMeans clustering to identify and remove non-relevant information from documents.
Table of Contents
Installation
You can install the package using pip:
pip install irrelevant-content-detection
Alternatively, you can clone the repository and install it locally:
git clone https://github.com/berkbirkan/irrelevant-content-detection.git
cd irrelevant-content-detection
pip install .
Usage
The package provides several functions to detect and clean irrelevant content from text and HTML.
Calculate Relevance Scores
The calculate_relevance_scores function calculates the TF-IDF scores for a list of texts.
from irrelevant_content_detection import calculate_relevance_scores
texts = [
"Python is a programming language.",
"This text is not relevant."
]
tfidf_scores = calculate_relevance_scores(texts)
print(tfidf_scores)
Detect Irrelevant Content in Text
The detect_irrelevant_contents function detects irrelevant content from a list of texts.
from irrelevant_content_detection import detect_irrelevant_contents
texts = [
"Python is a programming language.",
"Python is great for data science.",
"This text is not relevant.",
"Machine learning with Python is fun.",
"Unrelated text here."
]
irrelevant_texts = detect_irrelevant_contents(texts)
print(irrelevant_texts)
Clean Irrelevant Content from Text
The clean_irrelevant_contents function removes irrelevant content from a list of texts.
from irrelevant_content_detection import clean_irrelevant_contents
texts = [
"Python is a programming language.",
"Python is great for data science.",
"This text is not relevant.",
"Machine learning with Python is fun.",
"Unrelated text here."
]
cleaned_texts = clean_irrelevant_contents(texts)
print(cleaned_texts)
Extract Text from HTML
The extract_text_from_html function extracts all text from an HTML string.
from irrelevant_content_detection import extract_text_from_html
html = """
<html>
<body>
<p>Python is a programming language.</p>
<p>This text is not relevant.</p>
</body>
</html>
"""
texts = extract_text_from_html(html)
print(texts)
Detect Irrelevant Content in HTML
The detect_irrelevant_html function detects irrelevant content from an HTML string.
from irrelevant_content_detection import detect_irrelevant_html
html = """
<html>
<body>
<p>Python is a programming language.</p>
<p>Python is great for data science.</p>
<p>This text is not relevant.</p>
<p>Machine learning with Python is fun.</p>
<p>Unrelated text here.</p>
</body>
</html>
"""
irrelevant_html = detect_irrelevant_html(html)
print(irrelevant_html)
Clean Irrelevant Content from HTML
The clean_irrelevant_html function removes irrelevant content from an HTML string.
from irrelevant_content_detection import clean_irrelevant_html
html = """
<html>
<body>
<p>Python is a programming language.</p>
<p>Python is great for data science.</p>
<p>This text is not relevant.</p>
<p>Machine learning with Python is fun.</p>
<p>Unrelated text here.</p>
</body>
</html>
"""
cleaned_html = clean_irrelevant_html(html)
print(cleaned_html)
Testing
To run the tests, you can use unittest which is included in the Python Standard Library:
python -m unittest discover
Or you can run the test file directly:
python test_detector.py
Contributing
Contributions are welcome! Please follow these steps to contribute:
- Fork the repository.
- Create a new branch with your feature or bugfix.
- Commit your changes.
- Push to your branch.
- Create a pull request.
License
This project is licensed under the MIT License. See the LICENSE file for more details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file irrelevant_content_detection-0.3.tar.gz.
File metadata
- Download URL: irrelevant_content_detection-0.3.tar.gz
- Upload date:
- Size: 4.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
563087a31a6a9d234552a2e3dc09b531e617b7980f7c0f2c99e1008ed8cbb293
|
|
| MD5 |
9fbdb0e44171282fe899b37292b12e45
|
|
| BLAKE2b-256 |
e2c544f1b2621caee261df3dc77f0bd7c2590c75e3e7a34990cd14625ff9b420
|
File details
Details for the file irrelevant_content_detection-0.3-py3-none-any.whl.
File metadata
- Download URL: irrelevant_content_detection-0.3-py3-none-any.whl
- Upload date:
- Size: 4.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
677306f342b5f004d73733e9e3d71074c7bb70743c324ab79cb7a07fe04730a0
|
|
| MD5 |
604e5477a49a0a3a07e58f90d4b70d7f
|
|
| BLAKE2b-256 |
69e9b7bdb57c7c4af80e6d6ec8a3e46ffa9a425705222bdc808dadea20cb1688
|