Library with NLP Algorithms implemented from scratch
Project description
Scratch NLP 🧠
Library with foundational NLP Algorithms implemented from scratch using PyTorch.
Table of Contents 📋
- Documentation
- Installation
- Features
- Examples
- Contributing
- Acknowledgements
- About Me
- Lessons Learned
- License
- Feedback
Documentation 📝
Installation ⬇️
Install using pip
pip install scratch-nlp
Install Manually for development
Clone the repo
gh repo clone shanmukh05/scratch_nlp
Install dependencies
pip install -r requirements.txt
Features 🛠️
-
Algorithms
- Bag of Words
- Ngram
- TF-IDF
- Hidden Markov Model
- Word2Vec
- GloVe
- RNN (Many to One)
- LSTM (One to Many)
- GRU (Many to Many Synced)
- Seq2Seq + Attention (Many to Many)
- Transformer
- BERT
- GPT-2
-
Tokenization
- BypePair Encoding
- WordPiece Tokenizer
-
Metrics
- BLEU
- ROUGE (-N, -L, -S)
- Perplexity
- METEOR
- CIDER
-
Datasets
- IMDB Reviews Dataset
- Flickr Dataset
- NLTK POS Datasets (treebank, brown, conll2000)
- SQuAD QA Dataset
- Genius Lyrics Dataset
- LAMBADA Dataset
- Wiki en dataset
- English to Telugu Translation Dataset
-
Tasks
- Sentiment Classification
- POS Tagging
- Image Captioning
- Machine Translation
- Question Answering
- Text Generation
Implementation Details
Algorithm | Task | Tokenization | Output | Dataset |
---|---|---|---|---|
BOW | Text Representation | Preprocessed words |
|
IMDB Reviews |
Ngram | Text Representation | Preprocessed Words |
|
IMDB Reviews |
TF-IDF | Text Representation | Preprocessed words |
|
IMDB Reviews |
HMM | POS Tagging | Preprocessed words |
|
NLTK Treebank |
Word2Vec | Text Representation | Preprocessed words |
|
IMDB Reviews |
GloVe | Text Representation | Preprocessed words |
|
IMDB Reviews |
RNN | Sentiment Classification | Preprocessed words |
|
IMDB Reviews |
LSTM | Image Captioning | Preprocessed words |
|
Flickr 8k |
GRU | POS Tagging | Preprocessed words |
|
NLTK Treebank, Broown, Conll2000 |
Seq2Seq + Attention | Machine Translation | Tokenization |
|
English to Telugu Translation |
Transformer | Lyrics Generation | BytePairEncoding |
|
Genius Lyrics |
BERT | NSP Pretraining, QA Finetuning | WordPiece |
|
Wiki en, SQuAD v1 |
GPT-2 | Sentence Completition | BytePairEncoding |
|
LAMBADA |
Examples 🌟
Run Train and Inference directly through import
import yaml
from scratch_nlp.src.core.gpt import gpt
with open(config_path, "r") as stream:
config_dict = yaml.safe_load(stream)
gpt = gpt.GPT(config_dict)
gpt.run()
Run through CLI
cd src
python main.py --config_path '<config_path>' --algo '<algo name>' --log_folder '<output folder>'
Contributing 🤝
Contributions are always welcome!
See CONTRIBUTING.md for ways to get started.
Acknowledgements 💡
I have referred to so many online resources to create this project. I'm adding all the resources to RESOURCES.md. Thanks to all who has created those blogs/code/datasets 😊.
Thanks to CS224N course which gave me motivation to start this project
About Me 👤
I am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur.
Connect with me
Lessons Learned 📌
Most of the things present in this project are pretty new to me. I'm listing down my major learnings when creating this project
- NLP Algorithms
- Research paper Implementation
- Designing Project structure
- Documentation
- GitHub pages
- PIP packaging
License ⚖️
Feedback 📣
If you have any feedback, please reach out to me at venkatashanmukhasainathg@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scratchnlp-1.0.0.tar.gz
.
File metadata
- Download URL: scratchnlp-1.0.0.tar.gz
- Upload date:
- Size: 62.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f993cbe16c62e277c3bc12619c6465530eb72c3901c23f75d053668b99582c6c |
|
MD5 | 74c4531a5d0536475e0fd5faa38c13de |
|
BLAKE2b-256 | dba46028271368a93b124ecb507e0c009dc08beba4ab1212d2d626a4cffbb286 |
Provenance
The following attestation bundles were made for scratchnlp-1.0.0.tar.gz
:
Publisher:
python_package.yml
on shanmukh05/scratch_nlp
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
scratchnlp-1.0.0.tar.gz
- Subject digest:
f993cbe16c62e277c3bc12619c6465530eb72c3901c23f75d053668b99582c6c
- Sigstore transparency entry: 152696537
- Sigstore integration time:
- Predicate type:
File details
Details for the file ScratchNLP-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: ScratchNLP-1.0.0-py3-none-any.whl
- Upload date:
- Size: 90.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 483ce8b05611202931059f0f4152b6650065d61d150c9f32ca7573ccdcc0022d |
|
MD5 | d6f811b0f66bbf6368dc84d8271245ca |
|
BLAKE2b-256 | 57734b82d2613c12003664f9ba9faa420497b61f8bbd9a655857039b16080d94 |
Provenance
The following attestation bundles were made for ScratchNLP-1.0.0-py3-none-any.whl
:
Publisher:
python_package.yml
on shanmukh05/scratch_nlp
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
scratchnlp-1.0.0-py3-none-any.whl
- Subject digest:
483ce8b05611202931059f0f4152b6650065d61d150c9f32ca7573ccdcc0022d
- Sigstore transparency entry: 152696538
- Sigstore integration time:
- Predicate type: