Skip to main content

No project description provided

Project description

TFIDFVectorizer

TFIDFVectorizer is a custom implementation of the TF-IDF transformation algorithm, using scikit-learn's TfidfVectorizer as a base. The implementation is written in Python, making use of numpy, scikit-learn and other commonly used packages.

The main aim of this implementation is to provide a simple and efficient way of transforming a collection of text documents into a matrix representation, which can then be used as input to various machine learning algorithms.

Installation

The package can be installed using pip:

pip install tfIdfInheritVectorizer

Usage

To use the TFIDFVectorizer, simply create an instance of the class and call its fit_transform method. The method takes a list of text documents as input, and returns a sparse matrix representation of the TF-IDF scores for each document.

from tfIdfInheritVectorizer.feature_extraction.vectorizer import TFIDFVectorizer


text_data = [    "This is the first document.",    "This is the second document.",    "And this is the third one.",    "Is this the first document?"]

vectorizer = TFIDFVectorizer()
tfidf = vectorizer.fit_transform(text_data)

In addition to the fit_transform method, the TFIDFVectorizer also has a transform method that can be used to transform new text data into a matrix representation, given the model has already been fit to the training data.

new_text_data = [
    "This is a new document.",
    "Is this a new one?"
]

new_tfidf = vectorizer.transform(new_text_data)

Configuration

The TFIDFVectorizer has several parameters that can be configured to customize its behavior. Some of the most important parameters are:

  • stop_words: a list of stop words that will be ignored during the tokenization process
vectorizer = TFIDFVectorizer(stop_words=["is", "the", "this"])
  • max_features: the maximum number of features to keep, based on term frequency across the entire corpus.
vectorizer = TFIDFVectorizer(max_features=50)
  • use_idf: a flag indicating whether to use the inverse document frequency (IDF) weighting.
vectorizer = TFIDFVectorizer(use_idf=False)

For a full list of parameters, see the scikit-learn documentation

Conclusion

TFIDFVectorizer is a simple and efficient implementation of the TF-IDF transformation algorithm, suitable for use in various machine learning applications. By using scikit-learn as a base, it provides a wide range of customization options and can be easily integrated into existing machine learning workflows.

License

MIT

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfIdfInheritVectorizer-0.1.tar.gz (4.5 kB view details)

Uploaded Source

File details

Details for the file tfIdfInheritVectorizer-0.1.tar.gz.

File metadata

  • Download URL: tfIdfInheritVectorizer-0.1.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for tfIdfInheritVectorizer-0.1.tar.gz
Algorithm Hash digest
SHA256 5466c939bc6b2471100fcb18aedaf5712653c800d4844ee8eb05dd81630f46fd
MD5 f0f15a5e868dba3c0f7412899439020d
BLAKE2b-256 fc8014a1edd3773972d02da249a71b92262b543f56e064d7ba44b3bf433a6ec0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page