High-level PySpark tool for applying server-dependent functions
Project description
socketmap
High-level PySpark tool for applying server-dependent functions
Source Dependencies (Tested on Ubuntu 20.04)
PostgreSQL
sudo apt install postgresql
PySpark
- Go to https://spark.apache.org/downloads.html
- Select package type "Pre-built for Apache Hadoop 3.2 or later"
- Download and extract the tarball
- Run the following
cd spark-3.1.1-bin-hadoop3.2/python
python3 setup.py sdist
sudo python3 -m pip install sdist/*.tar.gz
Test Dependencies
Stanford Core NLP
wget http://nlp.stanford.edu/software/stanford-corenlp-latest.zip
unzip stanford-corenlp-latest.zip
export STANFORD_NLP_PATH=$PWD/stanford-corenlp-4.2.0
sudo python3 -m pip install pycorenlp
Installation
sudo python3 -m pip install socketmap
Tests
bash tests/shell/test_socketmap.sh
Example
Python source script
from pyspark.sql import SparkSession
from pycorenlp import StanfordCoreNLP
from socketmap import socketmap
def parse_sentences(input_rows_iterator):
nlp = StanfordCoreNLP('http://localhost:9000')
outputs = []
for row in input_rows_iterator:
sentence = row['sentence']
response = nlp.annotate(
sentence,
properties={'annotators': 'parse', 'outputFormat': 'json'},
)
output = {'tree': response['sentences'][0]['parse']}
outputs.append(output)
return outputs
spark = SparkSession.builder.getOrCreate()
sentences = [
['The ball is red.'],
['I went to the store.'],
['There is a wisdom that is a woe.'],
]
input_dataframe = spark.createDataFrame(sentences, ['sentence'])
output_dataframe = socketmap(spark, input_dataframe, parse_sentences)
Spark driver
DRIVER_CORES=32
APP_NAME=example
DRIVER_MEMORY=160g
EXECUTOR_MEMORY=3g
# run corenlp server
CURDIR=$PWD
cd $STANFORD_NLP_PATH
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 &
cd $CURDIR
sudo runuser -l postgres -c "source $HOME/paths && $SPARK_HOME/bin/spark-submit \
--name $APP_NAME \
--driver-cores $DRIVER_CORES \
--driver-memory $DRIVER_MEMORY \
--executor-memory $EXECUTOR_MEMORY \
${HOME}/socketmap/scripts/python/parse_sentences.py"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
socketmap-0.2.8.tar.gz
(4.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file socketmap-0.2.8.tar.gz.
File metadata
- Download URL: socketmap-0.2.8.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06c3acae085d615f2d114b5864e5804af58f08d95e3904778d647bc61b80b4d2
|
|
| MD5 |
bdead1491df0546ec0a62c41e7de106d
|
|
| BLAKE2b-256 |
dd78088d27783a3673f91816daed7702790488bb389ca038cc5cb6bc1f9cb517
|
File details
Details for the file socketmap-0.2.8-py3-none-any.whl.
File metadata
- Download URL: socketmap-0.2.8-py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f87f902ff9530d0d2a54ede30480b4f2991e32f417f401adaf850e8aecb812ed
|
|
| MD5 |
827f5dbba952de66d62fcc60de318dad
|
|
| BLAKE2b-256 |
d3d756f0918165184c59e319109dbbb50c272c971c14f8f784bf3f6cab693c96
|