Skip to main content

A tool to find relevant Wikipedia articles for a given paragraph and score them.

Project description

KS Domain Tagger

A Python tool to analyze a given paragraph, identify relevant Wikipedia articles, and score their relevance. It uses keyword extraction, Wikipedia API searches, and fuzzy string matching to determine the most appropriate articles.

Features

  • Keyword Extraction: Identifies key terms, bigrams, and trigrams from the input text using TF-IDF and NLTK.
  • Wikipedia Integration: Searches Wikipedia for articles based on extracted keywords.
  • Content Fetching: Retrieves textual content (paragraphs) from Wikipedia articles.
  • Relevance Scoring: Compares the input paragraph with Wikipedia content using fuzzy matching (rapidfuzz) and normalizes scores using softmax.
  • Two-Pass Search (Optional): Can perform a second pass by exploring links from initially matched Wikipedia pages for a more comprehensive search.
  • Paragraph Validation: Checks input paragraph length and cleans it by removing stop words.

Installation

pip install ks-domain-tagger

Usage

To use the judge function, you can import it into your Python script:

from ks_domain_tagger import judge # Assuming __init__.py makes judge available

paragraph_to_analyze = """
Manmohan Singh, an economist and politician, served as the 13th Prime Minister of India
from 2004 to 2014. Renowned for his role in the economic reforms of the 1990s, Singh
was instrumental in steering the country toward liberalization, fostering economic growth,
and enhancing India's global standing. His tenure as Finance Minister in 1991, during a
time of economic crisis, marked a pivotal moment in India's transformation, with bold
measures such as trade liberalization, reducing government control, and encouraging
foreign investment. A man of humility and intellect, Singh's leadership was marked by
pragmatism and caution. He is widely respected for his integrity and efforts to balance
economic growth with social development. Despite his relatively low-key personality,
Manmohan Singh’s impact on India’s economic landscape remains indelible, solidifying his
legacy as a key architect of modern India’s economic foundation.
"""

# Basic usage
results = judge.judge(paragraph_to_analyze)
print(results)

# Usage with second pass and different thresholds
results_pass2 = judge.judge(
    para=paragraph_to_analyze,
    threshold=50,        # Initial similarity threshold for pass 1
    pass2=True,          # Enable second pass
    threshold2=53,       # Similarity threshold for pass 2
    visit_all_pages=False # For pass 2, only search links in summary sections
)
print(results_pass2)

The judge function returns a dictionary where keys are the titles of relevant Wikipedia articles and values are their softmax scores indicating relevance.

Dependencies

The project relies on the following Python libraries:

  • nltk>=3.6
  • scikit-learn>=1.0
  • requests>=2.25
  • beautifulsoup4>=4.9
  • rapidfuzz>=1.8
  • numpy>=1.20
  • termcolor>=1.1.0 (primarily for test.py)

These will be handled automatically if installing via pip from PyPI.

How It Works

  1. Input & Validation: The input paragraph is validated for length and cleaned by removing common stop words.
  2. Keyword Extraction: Keywords (single words, bigrams, trigrams) are extracted using TF-IDF and NLTK.
  3. Wikipedia Search (Pass 1): Keywords are used to find relevant articles via the Wikipedia API.
  4. Content Fetching (Pass 1): Content from these articles is downloaded.
  5. Scoring (Pass 1): The input paragraph is compared against fetched Wikipedia paragraphs using rapidfuzz. Scores are normalized using softmax.
  6. Wikipedia Search (Pass 2 - Optional): If enabled, links from the top articles found in Pass 1 are explored to find more potentially relevant articles.
  7. Content Fetching & Scoring (Pass 2 - Optional): Content from these new articles is fetched and scored similarly.
  8. Output: The system outputs a list of relevant Wikipedia titles and their relevance scores.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ks_domain_tagger-0.1.0.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ks_domain_tagger-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file ks_domain_tagger-0.1.0.tar.gz.

File metadata

  • Download URL: ks_domain_tagger-0.1.0.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for ks_domain_tagger-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1f48a5f33243b3023ef3cdac07bc2579e9ede3694c2920c944701a195a9d2807
MD5 b43a0da1c44e49675a5b6790ba3a6c90
BLAKE2b-256 6bf177f2a5f4580dbbbfac82a46d535cd429996c18d509e583dcba6eb282590c

See more details on using hashes here.

File details

Details for the file ks_domain_tagger-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ks_domain_tagger-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab133cc92da30991720a05f22c3d9c96f70751aa9cbb748e783f0ed4200091c0
MD5 d7dadd6e74aaa00fdee08761b2c7dd7a
BLAKE2b-256 a15c586d874c788634db37ab4793e3ea4ccbba2464d9f76a9799d07c3c1bc9e7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page