Skip to main content

loads spark

Project description

This package is a configuration loading tool for pyspark. Spark configs are specified in yaml with some examples in the conf folder. The primary entry point to loading spark is through the NewSparkSession class. Used like this:

from spark_loader.spark import NewSparkSession
 
sess = NewSparkSession('local', 'main', './conf/local.yaml')

The spark-nlp library is included, some workflow examples are with an initiated session are:

Graph extraction

from sparknlp.base import  DocumentAssembler, Pipeline
from sparknlp.annotator import (
    NerDLModel, NerDLApproach, 
    GraphExtraction, UniversalSentenceEncoder,
    Tokenizer, WordEmbeddingsModel
)


# load spark session before this

use = UniversalSentenceEncoder \
    .pretrained() \
    .setInputCols("document") \
    .setOutputCol("use_embeddings")

document_assembler = DocumentAssembler() \
    .setInputCol("value") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel \
    .pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")


ner_tagger = NerDLModel \
    .pretrained() \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")

graph_extraction = GraphExtraction() \
            .setInputCols(["document", "token", "ner"]) \
            .setOutputCol("graph") \
            .setRelationshipTypes(["lad-PER", "lad-LOC"]) \
            .setMergeEntities(True)

graph_pipeline = Pipeline() \
    .setStages([
        document_assembler, tokenizer,
        word_embeddings, ner_tagger,
        graph_extraction
    ])

df = sess.read.text('./data/train.dat')
graph_pipeline.fit(df).transform(df)

LDA topic modeling

from sparknlp.base import  DocumentAssembler, Pipeline, Finisher
from sparknlp.annotator import (
  Normalizer, Tokenizer, StopWordsCleaner, Stemmer
)
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import CountVectorizer
 
document_assembler = DocumentAssembler() \
    .setInputCol("value") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

tokenizer = Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")

finisher = Finisher() \
    .setInputCols(["stem"]) \
    .setOutputCols(["tokens"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(False)

pipe = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            stemmer, 
            finisher])

nlp_model = nlp_pipeline.fit(df)
processed_df  = nlp_model.transform(df)

cv = CountVectorizer(inputCol="tokens", outputCol="features", vocabSize=12000, minDF=3.0) 
cv_model = cv.fit(processed_df) 
vectorized_tokens = cv_model.transform(processed_df)

print("forming topics")

num_topics = 10
lda = LDA(k=num_topics, maxIter=50)
model = lda.fit(vectorized_tokens)
ll = model.logLikelihood(vectorized_tokens)
lp = model.logPerplexity(vectorized_tokens)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_loader-0.1.0.tar.gz (3.8 kB view details)

Uploaded Source

Built Distribution

spark_loader-0.1.0-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file spark_loader-0.1.0.tar.gz.

File metadata

  • Download URL: spark_loader-0.1.0.tar.gz
  • Upload date:
  • Size: 3.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.9.13 Linux/6.6.10-76060610-generic

File hashes

Hashes for spark_loader-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3defabb5e9d1e102291d08adf85e9f73a99e64f2d4a06f3206ce2d575ebe1f6f
MD5 c4e28f1bb455523b05e13e83ebbb00d5
BLAKE2b-256 203a158c520784d123a16938f09b5867e36165753c4765e58870de8d2b1d2024

See more details on using hashes here.

File details

Details for the file spark_loader-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: spark_loader-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.9.13 Linux/6.6.10-76060610-generic

File hashes

Hashes for spark_loader-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab2dbab2b1fc370dbd7dda730a7bf258d7c8ef8f6321d3313cd162c6ba6345ef
MD5 921fc28a5794bc7bd1cf9c5a721f99d6
BLAKE2b-256 e6ff71664a7c0d7de39bfb384482f5a1e7bfea2e9e80956012faccee33830fbb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page