load table files and parse them into parquet or pyarrow format

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

代码结构说明

ReadAndSplit_Chunk.py

设定读取文件的路径，输出路径，输出格式（非必要），分块大小

FormParser类
convert_to_output_format 转换输出

def processing_single_file(input_file_path : str, output_dir_path : str, output_format : str, max_size : int) -> list[dict]

入参：转换文件所在文件夹路径，文件输出路径, 文件输出格式（可选，默认parquet）, 分块大小（可选，默认1gb）
出参：sciendb标准化info

./huggingface

my_huggingface_descriptive_statistics

根据parquet文件生成统计描述文件

DescriptiveStatisticsGenerator
generate_and_save_json（）

./utils

my_pandas_type_confirming_utils

字符串类型数据推断

MyPandasTypeConfirmingUtils

Invaild_Chars

文件名/路径名非法字符过滤

split_file_name_format_checker

创建duckdb索引遵循：大文件分块后只处理part00，其余跳过；不分块直接创建duckdb索引此工具用于识别哪些文件需要创建duckdb索引

FormatChecker.check_index_generate_format()

header_detector.py

检测csv等文件表头列名工具类

csv_chunk_read_util

分块读取csv工具类，用于大文件读取并分块

./dataLoader

base_file_loader.py

文件读取基类，包含read_and_convert读取转换抽象方法，需要子类实现

get_invalid_chars # 获取文件命名非法字符
_split_dataframe_by_size # df分割

file_loader_factory

文件读取类的工厂类，根据传入需要读取的类型生成对应的数据loader

excel_loader

excel类型读取转换逻辑

hdf5_loader

hdf5类型文件读取转换

single_table_loader

'.csv', '.sav', '.tsv', '.ods', '.parquet', '.tab' 类型转换

./dataWriter

base_writer_utils

基础输出写文件类

包含通用方法

_analyze_all_schemas # 分析分块schema
auto_convert_column # 更安全的类型转换逻辑
clean_dataframe # 清洗dataframe
write_dataframe_in_chunks # 分块写入文件
save_as_output_format_with_chunks #输出

parquet_writer.py

parquet输出

pyarrow_writer.py

pyarrow输出

./despatch

./duckDBUtils

my_parquet_fts_indexer_upgrade

读取parquet文件，识别字符串数据类型数据，并创建duckdb全文索引，输出duckdb文件

DuckDBFTSIndexer.create_fts_index()

每个duckdb中数据表名称固定为： data 生成的全文索引名称固定为：fts_main_data

统计文件statics.json类型

    "class_label",
    "float",
    "int",  
    "string_label",
    "string_text",
    "bool",
    "list",
    "datetime",
    "unknown"

文本类文件转Markdown

ProcessText2MD.py

直接调用

process_text_file(input_file_path, output_path)

即可，目前支持的文件类型:

pdf
pptx
ppt
docx
doc
html
epub

使用的marker需要安装的依赖:

conda install weasyprint -c conda-forge
conda install Pango -c conda-forge
conda install fontTools -c conda-forge
conda install -c conda-forge fontconfig
conda install -c conda-forge pillow
conda install -c conda-forge freetype libpng
conda install transformers -c conda-forge
conda install tensorflow -c conda-forge
# 核心依赖
pip install marker-pdf[full]

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.5

Sep 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

table_file_parser-0.1.5-py3-none-any.whl (63.9 kB view details)

Uploaded Sep 11, 2025 Python 3

File details

Details for the file table_file_parser-0.1.5-py3-none-any.whl.

File metadata

Download URL: table_file_parser-0.1.5-py3-none-any.whl
Upload date: Sep 11, 2025
Size: 63.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for table_file_parser-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7138a86260c7019b045cc2f95dd0a46eb290f763e4905f35b01c9ffce9da2b7a`
MD5	`e9324ab09392fc1f7626a7f0bcb43dcc`
BLAKE2b-256	`1f0c936c33a359a84c0214c206d466c284ee1ff1c0c0b51b558802a9dc7afa07`

See more details on using hashes here.

table-file-parser 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

代码结构说明

ReadAndSplit_Chunk.py

./huggingface

my_huggingface_descriptive_statistics

./utils

my_pandas_type_confirming_utils

Invaild_Chars

split_file_name_format_checker

header_detector.py

csv_chunk_read_util

./dataLoader

base_file_loader.py

file_loader_factory

excel_loader

hdf5_loader

single_table_loader

./dataWriter

base_writer_utils

parquet_writer.py

pyarrow_writer.py

./despatch

./duckDBUtils

my_parquet_fts_indexer_upgrade

统计文件statics.json类型

文本类文件转Markdown

ProcessText2MD.py

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes