Usefull functions for working with Database in PySpark (PostgreSQL, ClickHouse)
Project description
# pyspark_db_utils
It helps you with your DB deals in Spark
## Documentation
http://pyspark-db-utils.readthedocs.io/en/latest/
## Example of using
You need jdbc drivers for using this lib!
Just get drivers from
https://jdbc.postgresql.org/download.html
https://github.com/yandex/clickhouse-jdbc
and put it in jars/ directory in your project
### Example settings:
```
settings = {
"PG_PROPERTIES": {
"user": "user",
"password": "pass",
"driver": "org.postgresql.Driver"
},
"PG_DRIVER_PATH": "jars/postgresql-42.1.4.jar",
"PG_URL": "jdbc:postgresql://db.olabs.com/dbname",
}
```
### Example of code
see example.py
### Example of run
```
vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ mkdir jars
vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ cp /var/bigdata/spark-2.2.0-bin-hadoop2.7/jars/postgresql-42.1.4.jar ./jars/
vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ python3 pyspark_db_utils/example.py
host: ***SECRET***
db: ***SECRET***
user: ***SECRET***
password: ***SECRET***
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/03/05 11:43:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/05 11:43:29 WARN Utils: Your hostname, vsmelov resolves to a loopback address: 127.0.1.1; using 192.168.43.26 instead (on interface wlp2s0)
18/03/05 11:43:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
TRY: create df
OK: create df
+---+-----------+
| id| mono_id|
+---+-----------+
| 1| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| 5| 8589934592|
| 6| 8589934593|
| 7| 8589934594|
| 8| 8589934595|
| 9| 8589934596|
| 10|17179869184|
| 11|17179869185|
| 12|17179869186|
| 13|17179869187|
| 14|17179869188|
| 15|25769803776|
| 16|25769803777|
| 17|25769803778|
| 18|25769803779|
| 19|25769803780|
+---+-----------+
TRY: write_to_pg
OK: write_to_pg
TRY: read_from_pg
OK: read_from_pg
+---+-----------+
| id| mono_id|
+---+-----------+
| 10|17179869184|
| 11|17179869185|
| 12|17179869186|
| 13|17179869187|
| 14|17179869188|
| 1| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| 5| 8589934592|
| 6| 8589934593|
| 7| 8589934594|
| 8| 8589934595|
| 9| 8589934596|
| 15|25769803776|
| 16|25769803777|
| 17|25769803778|
| 18|25769803779|
| 19|25769803780|
| 1| 0|
+---+-----------+
only showing top 20 rows
```
It helps you with your DB deals in Spark
## Documentation
http://pyspark-db-utils.readthedocs.io/en/latest/
## Example of using
You need jdbc drivers for using this lib!
Just get drivers from
https://jdbc.postgresql.org/download.html
https://github.com/yandex/clickhouse-jdbc
and put it in jars/ directory in your project
### Example settings:
```
settings = {
"PG_PROPERTIES": {
"user": "user",
"password": "pass",
"driver": "org.postgresql.Driver"
},
"PG_DRIVER_PATH": "jars/postgresql-42.1.4.jar",
"PG_URL": "jdbc:postgresql://db.olabs.com/dbname",
}
```
### Example of code
see example.py
### Example of run
```
vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ mkdir jars
vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ cp /var/bigdata/spark-2.2.0-bin-hadoop2.7/jars/postgresql-42.1.4.jar ./jars/
vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ python3 pyspark_db_utils/example.py
host: ***SECRET***
db: ***SECRET***
user: ***SECRET***
password: ***SECRET***
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/03/05 11:43:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/05 11:43:29 WARN Utils: Your hostname, vsmelov resolves to a loopback address: 127.0.1.1; using 192.168.43.26 instead (on interface wlp2s0)
18/03/05 11:43:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
TRY: create df
OK: create df
+---+-----------+
| id| mono_id|
+---+-----------+
| 1| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| 5| 8589934592|
| 6| 8589934593|
| 7| 8589934594|
| 8| 8589934595|
| 9| 8589934596|
| 10|17179869184|
| 11|17179869185|
| 12|17179869186|
| 13|17179869187|
| 14|17179869188|
| 15|25769803776|
| 16|25769803777|
| 17|25769803778|
| 18|25769803779|
| 19|25769803780|
+---+-----------+
TRY: write_to_pg
OK: write_to_pg
TRY: read_from_pg
OK: read_from_pg
+---+-----------+
| id| mono_id|
+---+-----------+
| 10|17179869184|
| 11|17179869185|
| 12|17179869186|
| 13|17179869187|
| 14|17179869188|
| 1| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| 5| 8589934592|
| 6| 8589934593|
| 7| 8589934594|
| 8| 8589934595|
| 9| 8589934596|
| 15|25769803776|
| 16|25769803777|
| 17|25769803778|
| 18|25769803779|
| 19|25769803780|
| 1| 0|
+---+-----------+
only showing top 20 rows
```
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pyspark_db_utils-0.0.6.tar.gz.
File metadata
- Download URL: pyspark_db_utils-0.0.6.tar.gz
- Upload date:
- Size: 6.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78c28b34eb10458e8803fc782b37d720915a977316bfa6a04c88789f065c8e21
|
|
| MD5 |
06a0bdb3545f7d8464122aba1b6fedd4
|
|
| BLAKE2b-256 |
22663969c09c5658eeb690cb0065d157ce5d8d0d87390803b6eaf8fa2dbf0840
|