Skip to main content

High-level PySpark tool for applying server-dependent functions

Project description

socketmap

High-level PySpark tool for applying server-dependent functions

Source Dependencies (Tested on Ubuntu 20.04)

PostgreSQL

sudo apt install postgresql

PySpark

  1. Go to https://spark.apache.org/downloads.html
  2. Select package type "Pre-built for Apache Hadoop 3.2 or later"
  3. Download and extract the tarball
  4. Run the following
cd spark-3.1.1-bin-hadoop3.2/python
python3 setup.py sdist
sudo python3 -m pip install sdist/*.tar.gz

Test Dependencies

Stanford Core NLP

wget http://nlp.stanford.edu/software/stanford-corenlp-latest.zip
unzip stanford-corenlp-latest.zip
export STANFORD_NLP_PATH=$PWD/stanford-corenlp-4.2.0
sudo python3 -m pip install pycorenlp

Installation

sudo python3 -m pip install socketmap

Tests

bash tests/shell/test_socketmap.sh

Example

Python source script

from pyspark.sql import SparkSession
from pycorenlp import StanfordCoreNLP
from socketmap import socketmap


def parse_sentences(input_rows_iterator):
    nlp = StanfordCoreNLP('http://localhost:9000')
    outputs = []
    for row in input_rows_iterator:
        sentence = row['sentence']
        response = nlp.annotate(
            sentence,
            properties={'annotators': 'parse', 'outputFormat': 'json'},
        )
        output = {'tree': response['sentences'][0]['parse']}
        outputs.append(output)
    return outputs


spark = SparkSession.builder.getOrCreate()
sentences = [
    ['The ball is red.'],
    ['I went to the store.'],
    ['There is a wisdom that is a woe.'],
]
input_dataframe = spark.createDataFrame(sentences, ['sentence'])
output_dataframe = socketmap(spark, input_dataframe, parse_sentences)

Spark driver

DRIVER_CORES=32
APP_NAME=example
DRIVER_MEMORY=160g
EXECUTOR_MEMORY=3g

# run corenlp server
CURDIR=$PWD
cd $STANFORD_NLP_PATH
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 &
cd $CURDIR

sudo runuser -l postgres -c "source $HOME/paths && $SPARK_HOME/bin/spark-submit \
    --name $APP_NAME \
    --driver-cores $DRIVER_CORES \
    --driver-memory $DRIVER_MEMORY \
    --executor-memory $EXECUTOR_MEMORY \
    ${HOME}/socketmap/scripts/python/parse_sentences.py"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

socketmap-0.2.8.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

socketmap-0.2.8-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file socketmap-0.2.8.tar.gz.

File metadata

  • Download URL: socketmap-0.2.8.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.5

File hashes

Hashes for socketmap-0.2.8.tar.gz
Algorithm Hash digest
SHA256 06c3acae085d615f2d114b5864e5804af58f08d95e3904778d647bc61b80b4d2
MD5 bdead1491df0546ec0a62c41e7de106d
BLAKE2b-256 dd78088d27783a3673f91816daed7702790488bb389ca038cc5cb6bc1f9cb517

See more details on using hashes here.

File details

Details for the file socketmap-0.2.8-py3-none-any.whl.

File metadata

  • Download URL: socketmap-0.2.8-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.5

File hashes

Hashes for socketmap-0.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f87f902ff9530d0d2a54ede30480b4f2991e32f417f401adaf850e8aecb812ed
MD5 827f5dbba952de66d62fcc60de318dad
BLAKE2b-256 d3d756f0918165184c59e319109dbbb50c272c971c14f8f784bf3f6cab693c96

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page