loads spark
Project description
This package is a configuration loading tool for pyspark. Spark configs are specified in yaml with some examples in the conf folder. The primary entry point to loading spark is through the NewSparkSession class. Used like this:
from spark_loader.spark import NewSparkSession
sess = NewSparkSession('local', 'main', './conf/local.yaml')
The spark-nlp library is included, some workflow examples are with an initiated session are:
Graph extraction
from sparknlp.base import DocumentAssembler, Pipeline
from sparknlp.annotator import (
NerDLModel, NerDLApproach,
GraphExtraction, UniversalSentenceEncoder,
Tokenizer, WordEmbeddingsModel
)
# load spark session before this
use = UniversalSentenceEncoder \
.pretrained() \
.setInputCols("document") \
.setOutputCol("use_embeddings")
document_assembler = DocumentAssembler() \
.setInputCol("value") \
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel \
.pretrained() \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_tagger = NerDLModel \
.pretrained() \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
graph_extraction = GraphExtraction() \
.setInputCols(["document", "token", "ner"]) \
.setOutputCol("graph") \
.setRelationshipTypes(["lad-PER", "lad-LOC"]) \
.setMergeEntities(True)
graph_pipeline = Pipeline() \
.setStages([
document_assembler, tokenizer,
word_embeddings, ner_tagger,
graph_extraction
])
df = sess.read.text('./data/train.dat')
graph_pipeline.fit(df).transform(df)
LDA topic modeling
from sparknlp.base import DocumentAssembler, Pipeline, Finisher
from sparknlp.annotator import (
Normalizer, Tokenizer, StopWordsCleaner, Stemmer
)
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import CountVectorizer
document_assembler = DocumentAssembler() \
.setInputCol("value") \
.setOutputCol("document") \
.setCleanupMode("shrink")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized")
stopwords_cleaner = StopWordsCleaner()\
.setInputCols("normalized")\
.setOutputCol("cleanTokens")\
.setCaseSensitive(False)
stemmer = Stemmer() \
.setInputCols(["cleanTokens"]) \
.setOutputCol("stem")
finisher = Finisher() \
.setInputCols(["stem"]) \
.setOutputCols(["tokens"]) \
.setOutputAsArray(True) \
.setCleanAnnotations(False)
pipe = Pipeline(
stages=[document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
finisher])
nlp_model = nlp_pipeline.fit(df)
processed_df = nlp_model.transform(df)
cv = CountVectorizer(inputCol="tokens", outputCol="features", vocabSize=12000, minDF=3.0)
cv_model = cv.fit(processed_df)
vectorized_tokens = cv_model.transform(processed_df)
print("forming topics")
num_topics = 10
lda = LDA(k=num_topics, maxIter=50)
model = lda.fit(vectorized_tokens)
ll = model.logLikelihood(vectorized_tokens)
lp = model.logPerplexity(vectorized_tokens)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
spark_loader-0.1.0.tar.gz
(3.8 kB
view details)
Built Distribution
File details
Details for the file spark_loader-0.1.0.tar.gz
.
File metadata
- Download URL: spark_loader-0.1.0.tar.gz
- Upload date:
- Size: 3.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.9.13 Linux/6.6.10-76060610-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3defabb5e9d1e102291d08adf85e9f73a99e64f2d4a06f3206ce2d575ebe1f6f |
|
MD5 | c4e28f1bb455523b05e13e83ebbb00d5 |
|
BLAKE2b-256 | 203a158c520784d123a16938f09b5867e36165753c4765e58870de8d2b1d2024 |
File details
Details for the file spark_loader-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: spark_loader-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.9.13 Linux/6.6.10-76060610-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab2dbab2b1fc370dbd7dda730a7bf258d7c8ef8f6321d3313cd162c6ba6345ef |
|
MD5 | 921fc28a5794bc7bd1cf9c5a721f99d6 |
|
BLAKE2b-256 | e6ff71664a7c0d7de39bfb384482f5a1e7bfea2e9e80956012faccee33830fbb |