Skip to main content

load table files and parse them into parquet or pyarrow format

Project description

代码结构说明

ReadAndSplit_Chunk.py

设定读取文件的路径,输出路径,输出格式(非必要),分块大小

FormParser类
convert_to_output_format 转换输出
def processing_single_file(input_file_path : str, output_dir_path : str, output_format : str, max_size : int) -> list[dict]

入参:转换文件所在文件夹路径,文件输出路径, 文件输出格式(可选,默认parquet), 分块大小(可选,默认1gb)
出参:sciendb标准化info

./huggingface

my_huggingface_descriptive_statistics

根据parquet文件生成统计描述文件

DescriptiveStatisticsGenerator
generate_and_save_json()

./utils

my_pandas_type_confirming_utils

字符串类型数据推断

MyPandasTypeConfirmingUtils

Invaild_Chars

文件名/路径名非法字符过滤

split_file_name_format_checker

创建duckdb索引遵循:大文件分块后只处理part00,其余跳过;不分块直接创建duckdb索引 此工具用于识别哪些文件需要创建duckdb索引

FormatChecker.check_index_generate_format()

header_detector.py

检测csv等文件表头列名工具类

csv_chunk_read_util

分块读取csv工具类,用于大文件读取并分块

./dataLoader

base_file_loader.py

文件读取基类,包含read_and_convert读取转换抽象方法,需要子类实现

get_invalid_chars # 获取文件命名非法字符
_split_dataframe_by_size # df分割

file_loader_factory

文件读取类的工厂类,根据传入需要读取的类型生成对应的数据loader

excel_loader

excel类型读取转换逻辑

hdf5_loader

hdf5类型文件读取转换

single_table_loader

'.csv', '.sav', '.tsv', '.ods', '.parquet', '.tab' 类型转换

./dataWriter

base_writer_utils

基础输出写文件类

包含通用方法

_analyze_all_schemas # 分析分块schema
auto_convert_column # 更安全的类型转换逻辑
clean_dataframe # 清洗dataframe
write_dataframe_in_chunks # 分块写入文件
save_as_output_format_with_chunks #输出

parquet_writer.py

parquet输出

pyarrow_writer.py

pyarrow输出

./despatch

./duckDBUtils

my_parquet_fts_indexer_upgrade

读取parquet文件,识别字符串数据类型数据,并创建duckdb全文索引,输出duckdb文件

DuckDBFTSIndexer.create_fts_index()

每个duckdb中数据表名称固定为: data 生成的全文索引名称固定为:fts_main_data

统计文件statics.json类型

    "class_label",
    "float",
    "int",  
    "string_label",
    "string_text",
    "bool",
    "list",
    "datetime",
    "unknown"

文本类文件转Markdown

ProcessText2MD.py

直接调用

process_text_file(input_file_path, output_path)

即可,目前支持的文件类型:

pdf
pptx
ppt
docx
doc
html
epub

使用的marker需要安装的依赖:

conda install weasyprint -c conda-forge
conda install Pango -c conda-forge
conda install fontTools -c conda-forge
conda install -c conda-forge fontconfig
conda install -c conda-forge pillow
conda install -c conda-forge freetype libpng
conda install transformers -c conda-forge
conda install tensorflow -c conda-forge
# 核心依赖
pip install marker-pdf[full]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

table_file_parer-0.1.3-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file table_file_parer-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for table_file_parer-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bd90b071a07539ad068bcef441fade107e6c2d2fb919a1d9db5506295b68c467
MD5 f2979c3284ece6ec9be7ff35a49ac271
BLAKE2b-256 6d5c35b02abcc285ccc7ed4ab4986a9bb6a1e18860d895bef1a0e1727b51f388

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page