Skip to main content

Molda is a sci-kit learn inspired Python library for text vectorization of corpora. It is adapted to work in pipelines and numpy arrays.

Project description

Molda

Molda is a sci-kit learn inspired Python library for text vectorization of corpora. It is adapted to work in pipelines and numpy arrays.

The current version supports many algorithms denoted by the following classes:

  • TTestVectorizer
  • TficfVectorizer
  • ObservedExpectedVectorizer
  • LTUVectorizer
  • Gref94Vectorizer
  • ATCVectorizer

These classes are based on the sci-kit learn's CountVectorizer.

You need to instantiate the vectorizer with the parameters you need, fit and apply the transformations. Here is an example:

from Tficf import TficfVectorizer

corpus = np.array([
    "Even though I enjoyed watching that, This is bullshit",
    "I really enjoyed watching that",
    "I resent watching this video"
])

y = [1, 0, 1]

v = TficfVectorizer()
v.fit(corpus, y)
v.transform(['Hello, there'])

Also, you can include the vectorizer in a pipeline, like in the following example:

pipe = Pipeline([
            ('vectorizer', TficfVectorizer()),
            ('scaler', StandardScaler(with_mean=False)),
            ('estimator', SVC())
        ])
pipe.fit(corpus, y)
pipe.score(corpus, y)
pipe.predict(['This is wonderful'])

Molda works with Pandas DataFrames too:

df = pd.read_csv('../irony-labeled.csv')
df = df.dropna()

corpus_ = df['comment_text'].to_numpy()
y_ = df['label'].to_numpy()

v = TficfVectorizer()
v.fit(corpus_, y_)
v.transform(['Hello, there', 'Goodbye'])

With love from Sigmoid.

We are open for feedback. Please send your impression to vladimir.stojoc@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molda-0.1.2.tar.gz (15.2 kB view hashes)

Uploaded Source

Built Distribution

molda-0.1.2-py3-none-any.whl (23.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page