A tool to find relevant Wikipedia articles for a given paragraph and score them.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

KS Domain Tagger

A Python tool to analyze a given paragraph, identify relevant Wikipedia articles, and score their relevance. It uses keyword extraction, Wikipedia API searches, and fuzzy string matching to determine the most appropriate articles.

Features

Keyword Extraction: Identifies key terms, bigrams, and trigrams from the input text using TF-IDF and NLTK.
Wikipedia Integration: Searches Wikipedia for articles based on extracted keywords.
Content Fetching: Retrieves textual content (paragraphs) from Wikipedia articles.
Relevance Scoring: Compares the input paragraph with Wikipedia content using fuzzy matching (rapidfuzz) and normalizes scores using softmax.
Two-Pass Search (Optional): Can perform a second pass by exploring links from initially matched Wikipedia pages for a more comprehensive search.
Paragraph Validation: Checks input paragraph length and cleans it by removing stop words.

Installation

pip install ks-domain-tagger

Usage

To use the judge function, you can import it into your Python script:

from ks_domain_tagger import judge # Assuming __init__.py makes judge available

paragraph_to_analyze = """
Manmohan Singh, an economist and politician, served as the 13th Prime Minister of India
from 2004 to 2014. Renowned for his role in the economic reforms of the 1990s, Singh
was instrumental in steering the country toward liberalization, fostering economic growth,
and enhancing India's global standing. His tenure as Finance Minister in 1991, during a
time of economic crisis, marked a pivotal moment in India's transformation, with bold
measures such as trade liberalization, reducing government control, and encouraging
foreign investment. A man of humility and intellect, Singh's leadership was marked by
pragmatism and caution. He is widely respected for his integrity and efforts to balance
economic growth with social development. Despite his relatively low-key personality,
Manmohan Singh’s impact on India’s economic landscape remains indelible, solidifying his
legacy as a key architect of modern India’s economic foundation.
"""

# Basic usage
results = judge.judge(paragraph_to_analyze)
print(results)

# Usage with second pass and different thresholds
results_pass2 = judge.judge(
    para=paragraph_to_analyze,
    threshold=50,        # Initial similarity threshold for pass 1
    pass2=True,          # Enable second pass
    threshold2=53,       # Similarity threshold for pass 2
    visit_all_pages=False # For pass 2, only search links in summary sections
)
print(results_pass2)

The judge function returns a dictionary where keys are the titles of relevant Wikipedia articles and values are their softmax scores indicating relevance.

Dependencies

The project relies on the following Python libraries:

nltk>=3.6
scikit-learn>=1.0
requests>=2.25
beautifulsoup4>=4.9
rapidfuzz>=1.8
numpy>=1.20
termcolor>=1.1.0 (primarily for test.py)

These will be handled automatically if installing via pip from PyPI.

How It Works

Input & Validation: The input paragraph is validated for length and cleaned by removing common stop words.
Keyword Extraction: Keywords (single words, bigrams, trigrams) are extracted using TF-IDF and NLTK.
Wikipedia Search (Pass 1): Keywords are used to find relevant articles via the Wikipedia API.
Content Fetching (Pass 1): Content from these articles is downloaded.
Scoring (Pass 1): The input paragraph is compared against fetched Wikipedia paragraphs using rapidfuzz. Scores are normalized using softmax.
Wikipedia Search (Pass 2 - Optional): If enabled, links from the top articles found in Pass 1 are explored to find more potentially relevant articles.
Content Fetching & Scoring (Pass 2 - Optional): Content from these new articles is fetched and scored similarly.
Output: The system outputs a list of relevant Wikipedia titles and their relevance scores.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.0

Jul 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ks_domain_tagger-0.1.0.tar.gz (10.9 kB view details)

Uploaded Jul 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ks_domain_tagger-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Jul 8, 2025 Python 3

File details

Details for the file ks_domain_tagger-0.1.0.tar.gz.

File metadata

Download URL: ks_domain_tagger-0.1.0.tar.gz
Upload date: Jul 8, 2025
Size: 10.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for ks_domain_tagger-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1f48a5f33243b3023ef3cdac07bc2579e9ede3694c2920c944701a195a9d2807`
MD5	`b43a0da1c44e49675a5b6790ba3a6c90`
BLAKE2b-256	`6bf177f2a5f4580dbbbfac82a46d535cd429996c18d509e583dcba6eb282590c`

See more details on using hashes here.

File details

Details for the file ks_domain_tagger-0.1.0-py3-none-any.whl.

File metadata

Download URL: ks_domain_tagger-0.1.0-py3-none-any.whl
Upload date: Jul 8, 2025
Size: 11.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for ks_domain_tagger-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab133cc92da30991720a05f22c3d9c96f70751aa9cbb748e783f0ed4200091c0`
MD5	`d7dadd6e74aaa00fdee08761b2c7dd7a`
BLAKE2b-256	`a15c586d874c788634db37ab4793e3ea4ccbba2464d9f76a9799d07c3c1bc9e7`

See more details on using hashes here.

ks-domain-tagger 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

KS Domain Tagger

Features

Installation

Usage

Dependencies

How It Works

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes