Skip to main content

load table files and parse them into parquet or pyarrow format

Project description

代码结构说明

ReadAndSplit_Chunk.py

设定读取文件的路径,输出路径,输出格式(非必要),分块大小

FormParser类
convert_to_output_format 转换输出
def processing_single_file(input_file_path : str, output_dir_path : str, output_format : str, max_size : int) -> list[dict]

入参:转换文件所在文件夹路径,文件输出路径, 文件输出格式(可选,默认parquet), 分块大小(可选,默认1gb)
出参:sciendb标准化info

./huggingface

my_huggingface_descriptive_statistics

根据parquet文件生成统计描述文件

DescriptiveStatisticsGenerator
generate_and_save_json()

./utils

my_pandas_type_confirming_utils

字符串类型数据推断

MyPandasTypeConfirmingUtils

Invaild_Chars

文件名/路径名非法字符过滤

split_file_name_format_checker

创建duckdb索引遵循:大文件分块后只处理part00,其余跳过;不分块直接创建duckdb索引 此工具用于识别哪些文件需要创建duckdb索引

FormatChecker.check_index_generate_format()

header_detector.py

检测csv等文件表头列名工具类

csv_chunk_read_util

分块读取csv工具类,用于大文件读取并分块

./dataLoader

base_file_loader.py

文件读取基类,包含read_and_convert读取转换抽象方法,需要子类实现

get_invalid_chars # 获取文件命名非法字符
_split_dataframe_by_size # df分割

file_loader_factory

文件读取类的工厂类,根据传入需要读取的类型生成对应的数据loader

excel_loader

excel类型读取转换逻辑

hdf5_loader

hdf5类型文件读取转换

single_table_loader

'.csv', '.sav', '.tsv', '.ods', '.parquet', '.tab' 类型转换

./dataWriter

base_writer_utils

基础输出写文件类

包含通用方法

_analyze_all_schemas # 分析分块schema
auto_convert_column # 更安全的类型转换逻辑
clean_dataframe # 清洗dataframe
write_dataframe_in_chunks # 分块写入文件
save_as_output_format_with_chunks #输出

parquet_writer.py

parquet输出

pyarrow_writer.py

pyarrow输出

./despatch

./duckDBUtils

my_parquet_fts_indexer_upgrade

读取parquet文件,识别字符串数据类型数据,并创建duckdb全文索引,输出duckdb文件

DuckDBFTSIndexer.create_fts_index()

每个duckdb中数据表名称固定为: data 生成的全文索引名称固定为:fts_main_data

统计文件statics.json类型

    "class_label",
    "float",
    "int",  
    "string_label",
    "string_text",
    "bool",
    "list",
    "datetime",
    "unknown"

文本类文件转Markdown

ProcessText2MD.py

直接调用

process_text_file(input_file_path, output_path)

即可,目前支持的文件类型:

pdf
pptx
ppt
docx
doc
html
epub

使用的marker需要安装的依赖:

conda install weasyprint -c conda-forge
conda install Pango -c conda-forge
conda install fontTools -c conda-forge
conda install -c conda-forge fontconfig
conda install -c conda-forge pillow
conda install -c conda-forge freetype libpng
conda install transformers -c conda-forge
conda install tensorflow -c conda-forge
# 核心依赖
pip install marker-pdf[full]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

table_file_parser-0.1.5-py3-none-any.whl (63.9 kB view details)

Uploaded Python 3

File details

Details for the file table_file_parser-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for table_file_parser-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 7138a86260c7019b045cc2f95dd0a46eb290f763e4905f35b01c9ffce9da2b7a
MD5 e9324ab09392fc1f7626a7f0bcb43dcc
BLAKE2b-256 1f0c936c33a359a84c0214c206d466c284ee1ff1c0c0b51b558802a9dc7afa07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page