Skip to main content

Convenient HDFS access using the Java HDFS client

Project description

JHDFS4PY

Convenient HDFS access using the Java HDFS client in python.

Installation

pip install jhdfs4py

Usage

Please also see the provided demo notebook at ./docs/demo.ipynb.

Create a file file.txt with the text content "Hello World".

from pyspark.sql import SparkSession
from jhdfs4py import HdfsFs

spark = SparkSession.builder.appName("my-app").getOrCreate()
hdfs = HdfsFs.from_spark_session(spark)

hdfs.write_string(
    path="/my/path/file.txt",
    data="Hello World",
)

Create a file other_file.blop with the byte content "some data".

from pyspark.sql import SparkSession
from jhdfs4py import HdfsFs

spark = SparkSession.builder.appName("my-app").getOrCreate()
hdfs = HdfsFs.from_spark_session(spark)

hdfs.write_bytes(
    path="/my/path/other_file.blop",
    data=b"some data",
)

Contribute

Report issues, submit Pull Requests, or contact us via mail (opensource@verpackungsregister.org).

Test Suite

Hadoop Windows Setup

On Windows, a few extra steps are required, before running the tests:

  1. Clone https://github.com/steveloughran/winutils into a directory of your choice
  2. Define the environment variable HADOOP_HOME and set it to X:\path\to\winutils\hadoop-3.0.0
  3. Append %HADOOP_HOME%\bin to the PATH environment variable

General Setup

The library comes with an extensive pytest test suite, that depends on a PY4J gateway running locally providing the actual HDFS implementation. The testing gateway is located in the tests/py4j-test-server folder and implemented by the org.zsvr.py4j.test.TestGatewayServer class. The test suite will try to start the gateway automatically on posix systems unless the USE_EXTERNAL_GATEWAY_SERVER environment variable is set. Use the SBT_SCRIPT environment variable, to tell the test suite where your SBT startup script is located.

To start the gateway manually (don't forget to set USE_EXTERNAL_GATEWAY_SERVER=1 in this case), you can either use your favourite IDE, after importing py4j-test-server as SBT project, or you can use SBT directly from the command line by entering sbt run after having changed to the tests/py4j-test-server directory.

To finally run the tests, change into the jhdf4py base directory (the one containing this README). Make sure all dependencies are met by typing pip install -r requirements.txt, and finally launch the test suite by entering pytest into the console. All tests are expected to pass.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

jhdfs4py-1.4.1-py3-none-any.whl (12.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page