John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
Spark NLP: State of the Art Natural Language Processing
Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports state-of-the-art transformers such as BERT, XLNet, ELMO, ALBERT, and Universal Sentence Encoder that can be used seamlessly in a cluster. It also offers Tokenization, Word Segmentation, Part-of-Speech Tagging, Named Entity Recognition, Dependency Parsing, Spell Checking, Multi-class Text Classification, Multi-class Sentiment Analysis, Machine Translation (+180 languages), Summarization and Question Answering (Google T5), and many more NLP tasks.
Take a look at our official Spark NLP page: http://nlp.johnsnowlabs.com/ for user documentation and examples
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
- Medium Spark NLP articles
- YouTube Spark NLP video tutorials
- Trainable Word Segmentation
- Stop Words Removal
- Token Normalizer
- Document Normalizer
- Regex Matching
- Text Matching
- Date Matcher
- Sentence Detector
- Deep Sentence Detector (Deep learning)
- Dependency parsing (Labeled/unlabeled)
- Part-of-speech tagging
- Sentiment Detection (ML models)
- Spell Checker (ML and DL models)
- Word Embeddings (GloVe and Word2Vec)
- BERT Embeddings (TF Hub models)
- ELMO Embeddings (TF Hub models)
- ALBERT Embeddings (TF Hub models)
- XLNet Embeddings
- Universal Sentence Encoder (TF Hub models)
- BERT Sentence Embeddings (42 TF Hub models)
- Sentence Embeddings
- Chunk Embeddings
- Unsupervised keywords extraction
- Language Detection & Identification (up to 375 languages)
- Multi-class Sentiment analysis (Deep learning)
- Multi-label Sentiment analysis (Deep learning)
- Multi-class Text Classification (Deep learning)
- Neural Machine Translation
- Text-To-Text Transfer Transformer (Google T5)
- Named entity recognition (Deep learning)
- Easy TensorFlow integration
- GPU Support
- Full integration with Spark ML functions
- +710 pre-trained models in +192 languages!
- +450 pre-trained pipelines in +192 languages!
- Multi-lingual NER models: Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Hewbrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, and Urdu.
This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark:
$ java -version # should be Java 8 (Oracle or OpenJDK) $ conda create -n sparknlp python=3.6 -y $ conda activate sparknlp $ pip install spark-nlp pyspark==2.4.7
In Python console or Jupyter
# Import Spark NLP from sparknlp.base import * from sparknlp.annotator import * from sparknlp.pretrained import PretrainedPipeline import sparknlp # Start Spark Session with Spark NLP # start() functions has two parameters: gpu and spark23 # sparknlp.start(gpu=True) will start the session with GPU support # sparknlp.start(spark23=True) is when you have Apache Spark 2.3.x installed spark = sparknlp.start() # Download a pre-trained pipeline pipeline = PretrainedPipeline('explain_document_dl', lang='en') # Your testing dataset text = """ The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris. """ # Annotate your testing dataset result = pipeline.annotate(text) # What's in the pipeline list(result.keys()) Output: ['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'] # Check the results result['entities'] Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']
For more examples, you can visit our dedicated repository to showcase all Spark NLP use cases!
Release history Release notifications | RSS feed
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size spark_nlp-2.7.5-py2.py3-none-any.whl (140.0 kB)||File type Wheel||Python version py2.py3||Upload date||Hashes View|
|Filename, size spark-nlp-2.7.5.tar.gz (30.9 kB)||File type Source||Python version None||Upload date||Hashes View|
Hashes for spark_nlp-2.7.5-py2.py3-none-any.whl