Skip to main content

loads spark

Project description

This package is a configuration loading tool for pyspark. Spark configs are specified in yaml with some examples in the conf folder. The primary entry point to loading spark is through the NewSparkSession class. Used like this:

from spark_loader.spark import NewSparkSession
 
sess = NewSparkSession('local', 'main', './conf/local.yaml')

The spark-nlp library is included, some workflow examples are with an initiated session are:

Graph extraction

from sparknlp.base import  DocumentAssembler, Pipeline
from sparknlp.annotator import (
    NerDLModel, NerDLApproach, 
    GraphExtraction, UniversalSentenceEncoder,
    Tokenizer, WordEmbeddingsModel
)


# load spark session before this

use = UniversalSentenceEncoder \
    .pretrained() \
    .setInputCols("document") \
    .setOutputCol("use_embeddings")

document_assembler = DocumentAssembler() \
    .setInputCol("value") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel \
    .pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")


ner_tagger = NerDLModel \
    .pretrained() \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")

graph_extraction = GraphExtraction() \
            .setInputCols(["document", "token", "ner"]) \
            .setOutputCol("graph") \
            .setRelationshipTypes(["lad-PER", "lad-LOC"]) \
            .setMergeEntities(True)

graph_pipeline = Pipeline() \
    .setStages([
        document_assembler, tokenizer,
        word_embeddings, ner_tagger,
        graph_extraction
    ])

df = sess.read.text('./data/train.dat')
graph_pipeline.fit(df).transform(df)

LDA topic modeling

from sparknlp.base import  DocumentAssembler, Pipeline, Finisher
from sparknlp.annotator import (
  Normalizer, Tokenizer, StopWordsCleaner, Stemmer
)
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import CountVectorizer
 
document_assembler = DocumentAssembler() \
    .setInputCol("value") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

tokenizer = Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")

finisher = Finisher() \
    .setInputCols(["stem"]) \
    .setOutputCols(["tokens"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(False)

pipe = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            stemmer, 
            finisher])

nlp_model = nlp_pipeline.fit(df)
processed_df  = nlp_model.transform(df)

cv = CountVectorizer(inputCol="tokens", outputCol="features", vocabSize=12000, minDF=3.0) 
cv_model = cv.fit(processed_df) 
vectorized_tokens = cv_model.transform(processed_df)

print("forming topics")

num_topics = 10
lda = LDA(k=num_topics, maxIter=50)
model = lda.fit(vectorized_tokens)
ll = model.logLikelihood(vectorized_tokens)
lp = model.logPerplexity(vectorized_tokens)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_loader-0.0.5.tar.gz (3.8 kB view details)

Uploaded Source

Built Distribution

spark_loader-0.0.5-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file spark_loader-0.0.5.tar.gz.

File metadata

  • Download URL: spark_loader-0.0.5.tar.gz
  • Upload date:
  • Size: 3.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.9.13 Linux/6.6.6-76060606-generic

File hashes

Hashes for spark_loader-0.0.5.tar.gz
Algorithm Hash digest
SHA256 9b4c27ae890d6fd2fe9c44d5c853449ef46e0cc98ffa515534b7cf708335e2a5
MD5 bcb75ef7b50b1341c1b1c86380e65c64
BLAKE2b-256 0ffc0a53552365cdde130aebb4b69bbce39b110bd2b03b6471858f9771452d6b

See more details on using hashes here.

File details

Details for the file spark_loader-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: spark_loader-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.9.13 Linux/6.6.6-76060606-generic

File hashes

Hashes for spark_loader-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ca7aa051aefe33312d1e4b0dfce2103b71aa7da8676c0ac7035684c5886c44a9
MD5 947b355f464780e4bdc3c98ed3f36a36
BLAKE2b-256 6c0445cfa848bd54b132a29554b36ea34aa9829a2dab48f5593b143ff3531e4b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page