Skip to main content

Convenient HDFS access using the Java HDFS client

Project description

JHDFS4PY

Convenient HDFS access using the Java HDFS client in python.

Installation

pip install jhdfs4py

Usage

Please also see the provided demo notebook at ./docs/demo.ipynb.

Create a file file.txt with the text content "Hello World".

from pyspark.sql import SparkSession
from jhdfs4py import HdfsFs

spark = SparkSession.builder.appName("my-app").getOrCreate()
hdfs = HdfsFs.from_spark_session(spark)

hdfs.write_string(
    path="/my/path/file.txt",
    data="Hello World",
)

Create a file other_file.blop with the byte content "some data".

from pyspark.sql import SparkSession
from jhdfs4py import HdfsFs

spark = SparkSession.builder.appName("my-app").getOrCreate()
hdfs = HdfsFs.from_spark_session(spark)

hdfs.write_bytes(
    path="/my/path/other_file.blop",
    data=b"some data",
)

Contribute

Report issues, submit Pull Requests, or contact us via mail (opensource@verpackungsregister.org).

Test Suite

Hadoop Windows Setup

On Windows, a few extra steps are required, before running the tests:

  1. Clone https://github.com/steveloughran/winutils into a directory of your choice
  2. Define the environment variable HADOOP_HOME and set it to X:\path\to\winutils\hadoop-3.0.0
  3. Append %HADOOP_HOME%\bin to the PATH environment variable

General Setup

The library comes with an extensive pytest test suite, that depends on a PY4J gateway running locally providing the actual HDFS implementation. The testing gateway is located in the tests/py4j-test-server folder and implemented by the org.zsvr.py4j.test.TestGatewayServer class. The test suite will try to start the gateway automatically on posix systems unless the USE_EXTERNAL_GATEWAY_SERVER environment variable is set. Use the SBT_SCRIPT environment variable, to tell the test suite where your SBT startup script is located.

To start the gateway manually (don't forget to set USE_EXTERNAL_GATEWAY_SERVER=1 in this case), you can either use your favourite IDE, after importing py4j-test-server as SBT project, or you can use SBT directly from the command line by entering sbt run after having changed to the tests/py4j-test-server directory.

To finally run the tests, change into the jhdf4py base directory (the one containing this README). Make sure all dependencies are met by typing pip install -r requirements.txt, and finally launch the test suite by entering pytest into the console. All tests are expected to pass.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

jhdfs4py-1.4.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file jhdfs4py-1.4.1-py3-none-any.whl.

File metadata

  • Download URL: jhdfs4py-1.4.1-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.4

File hashes

Hashes for jhdfs4py-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 28f58d780330e1c6d5226584646a0107691056c2c64c08a6987a7a21df89c41c
MD5 794f07e407f81b3100e35c200170b601
BLAKE2b-256 cdc5b5a8ce025f12a603860f0fb4ebfaec4a49046b9811273fc287af1b0c22e7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page