Convenient HDFS access using the Java HDFS client
Project description
JHDFS4PY
Convenient HDFS access using the Java HDFS client in python.
Installation
pip install jhdfs4py
Usage
Please also see the provided demo notebook at ./docs/demo.ipynb
.
Create a file file.txt
with the text content "Hello World".
from pyspark.sql import SparkSession
from jhdfs4py import HdfsFs
spark = SparkSession.builder.appName("my-app").getOrCreate()
hdfs = HdfsFs.from_spark_session(spark)
hdfs.write_string(
path="/my/path/file.txt",
data="Hello World",
)
Create a file other_file.blop
with the byte content "some data".
from pyspark.sql import SparkSession
from jhdfs4py import HdfsFs
spark = SparkSession.builder.appName("my-app").getOrCreate()
hdfs = HdfsFs.from_spark_session(spark)
hdfs.write_bytes(
path="/my/path/other_file.blop",
data=b"some data",
)
Contribute
Report issues, submit Pull Requests, or contact us via mail (opensource@verpackungsregister.org).
Test Suite
Hadoop Windows Setup
On Windows, a few extra steps are required, before running the tests:
- Clone https://github.com/steveloughran/winutils into a directory of your choice
- Define the environment variable
HADOOP_HOME
and set it toX:\path\to\winutils\hadoop-3.0.0
- Append
%HADOOP_HOME%\bin
to thePATH
environment variable
General Setup
The library comes with an extensive pytest test suite, that depends on a PY4J gateway
running locally providing the actual HDFS implementation. The testing gateway is located in the tests/py4j-test-server
folder and implemented by
the org.zsvr.py4j.test.TestGatewayServer
class. The test suite will try to start the gateway automatically on posix systems unless the USE_EXTERNAL_GATEWAY_SERVER
environment variable is set. Use the SBT_SCRIPT
environment variable, to tell the test suite where your SBT startup
script is located.
To start the gateway manually (don't forget to set USE_EXTERNAL_GATEWAY_SERVER=1
in this case), you can either use your favourite IDE, after
importing py4j-test-server
as SBT project, or you can use SBT directly from the command
line by entering sbt run
after having changed to the tests/py4j-test-server
directory.
To finally run the tests, change into the jhdf4py
base directory (the one containing this README). Make sure all dependencies are met by typing
pip install -r requirements.txt
, and finally launch the test suite by entering pytest
into the console. All tests are expected to pass.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.