Skip to main content

ec-storage,it is used to replace the three copy storage strategy of HDFS, so as to save storage space

Project description

EC

EC存储(ec-storage)

file

源码文件目录(src/ecstorage):

  • __init__.py
  • mathematics 数学函数
    • __init__.py
    • galois.py 伽罗华域运算
    • generator_matrix.py 生成矩阵
    • matrix_optimization.py 矩阵优化
  • list.py list格式计算
  • rdd.py rdd格式计算
  • dataframe.py dataframe格式计算(开发中)

install

pip install ec-storage

manual

导入模块

ec-storage提供了适合三种数据格式的计算方式,分别为listrdddataframe(开发中),根据需要选择其中一种即可

import ecstorage.list as ec
import ecstorage.rdd as ec
import ecstorage.dataframe as ec

为了方便后续的书写,建议将导入ec-storage的list或rdd或dataframe命名为ec,如上述代码块所示

生成校验块

check_block = ec.reedsolomon(sc,data,m,generator_matrix)

这个ec需要与上面选择的接口一致(选择list接口则可以去掉sc这个参数)

恢复数据

recover_data = ec.verify(data,check_block)

recover_data 与 data 相同

demo

list格式

# 本地文件夹测试导入
# import sys
# sys.path.append("/Users/caiwei/Documents/code/EC-dev/src")

# 导入模块
import ecstorage.list as ec
import numpy as np

m = 3                       #选择校验块个数
generator_matrix = 'vander' #生成矩阵选择范德蒙德矩阵

data = [1, 0, 0, 8, 6]			#list格式数据
k = len(data)

check_block = ec.reedsolomon(data,m,generator_matrix)	#生成校验块(list格式)

# 测试(数据缺失个数+校验块缺失个数 <= m)
data[0] = None          # 缺失数据
data[1] = None
check_block[1] = None		#校验块也可以缺失
print(data)

# 恢复数据
recover_data = ec.verify(data,check_block,generator_matrix)	#恢复数据(list格式)
print(recover_data)		#[1, 0, 0, 8, 6]

RDD格式

# #本地文件夹测试导入
# import sys
# sys.path.append("/Users/caiwei/Documents/code/EC-dev/src")

# #统一python版本(有多个python版本的情况下)
# import os
# os.environ["PYSPARK_PYTHON"]="/Users/caiwei/opt/anaconda3/bin/python"
# os.environ["PYSPARK_DRIVER_PYTHON"]="/Users/caiwei/opt/anaconda3/bin/python"

# 导入必要的模块
import ecstorage.rdd as ec
from pyspark import SparkContext
from pyspark.mllib.linalg.distributed import *
from pyspark.sql import SparkSession
from ecstorage.mathematics.matrix_optimization import sparse
import numpy as np

m = 3                       #校验块个数
generator_matrix = 'vander' #生成矩阵选择范德蒙德矩阵

# 创建spark session
sc = SparkContext()
spark = SparkSession(sc)

# 数据
data = np.arange(1,6,1)
data = sc.parallelize(data)	#数据转为rdd格式

# 生成校验块
check_block = ec.reedsolomon(sc,data,m,generator_matrix)

# 测试(数据缺失个数+校验块缺失个数 <= m)
data = list(np.arange(1,6,1))
data[0] = None          # 缺失数据(缺失个数小于等于m)
data[1] = None
# data[2] = None


# 也可以是校验块有缺失数据
check_block = check_block.collect()
check_block[0] = None
check_block = sc.parallelize(check_block)

# 恢复数据
recover_data = ec.verify(sc,data,check_block,generator_matrix)  
print(recover_data.collect())

dataframe格式

# 文件夹测试导入
import sys
sys.path.append("/Users/caiwei/Documents/code/EC-dev/src")

# 导入模块
import ecstorage.dataframe as ec
from pyspark import SparkContext
from pyspark.mllib.linalg.distributed import *
from pyspark.sql import SparkSession
from ecstorage.mathematics.matrix_optimization import sparse
import numpy as np
import os
os.environ["PYSPARK_PYTHON"]="/Users/caiwei/opt/anaconda3/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"]="/Users/caiwei/opt/anaconda3/bin/python"

from pyspark.sql import SQLContext
m = 3                       #生成校验块个数
generator_matrix = 'vander' #生成矩阵选择范德蒙德矩阵

sc = SparkContext()
sqlContext = SQLContext(sc)

dicts = [
        {'col1':'a', 'col2':1},
        {'col1':'b', 'col2':2},
        {'col1':'b', 'col2':3},
        {'col1':'b', 'col2':4},
        {'col1':'b', 'col2':5},
         ]
df = sqlContext.createDataFrame(dicts)
data = df.select('col2')
# data.show()

check_block = ec.reedsolomon(sc,data,m)
check_block.show()


# 测试
dicts = [
        {'col1':'a', 'col2':None},
        {'col1':'b', 'col2':None},
        {'col1':'b', 'col2':3},
        {'col1':'b', 'col2':4},
        {'col1':'b', 'col2':5},
         ]
# # data[2] = None
# check_block = check_block.collect()
# check_block[0] = None
# check_block = sc.parallelize(check_block)
data = sqlContext.createDataFrame(dicts)
data = data.select('col2')
data.show()
# 恢复数据
recover_data = ec.verify(sc,data,check_block,generator_matrix)  
recover_data.show()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ec-storage-1.0.0.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

ec_storage-1.0.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file ec-storage-1.0.0.tar.gz.

File metadata

  • Download URL: ec-storage-1.0.0.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.1.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for ec-storage-1.0.0.tar.gz
Algorithm Hash digest
SHA256 63d5f5246697089faf1f6f775ee44d13a98cb639dea2ba568eed5463ae0b00ab
MD5 5f1446e88703187e0302644ed11cdcb3
BLAKE2b-256 ecddc0eca90414ff556754994117d129ff3273df9a96d85cbe6a7aca7f9e56bf

See more details on using hashes here.

File details

Details for the file ec_storage-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: ec_storage-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.8.1 keyring/23.1.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.7

File hashes

Hashes for ec_storage-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 80e4ba245006af56d3d58de16dc27250d68c75f4df5836750439a10b3cbbcbae
MD5 706c08d1fdbc3d0ad7a4d842c96fc062
BLAKE2b-256 a91fd27d04b996bab13ead5c2d476a040c91f3ab7a70216d4d88b57a0c6577d7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page