load table files and parse them into parquet or pyarrow format
Project description
代码结构说明
ReadAndSplit_Chunk.py
设定读取文件的路径,输出路径,输出格式(非必要),分块大小
FormParser类
convert_to_output_format 转换输出
def processing_single_file(input_file_path : str, output_dir_path : str, output_format : str, max_size : int) -> list[dict]
入参:转换文件所在文件夹路径,文件输出路径, 文件输出格式(可选,默认parquet), 分块大小(可选,默认1gb)
出参:sciendb标准化info
./huggingface
my_huggingface_descriptive_statistics
根据parquet文件生成统计描述文件
DescriptiveStatisticsGenerator
generate_and_save_json()
./utils
my_pandas_type_confirming_utils
字符串类型数据推断
MyPandasTypeConfirmingUtils
Invaild_Chars
文件名/路径名非法字符过滤
split_file_name_format_checker
创建duckdb索引遵循:大文件分块后只处理part00,其余跳过;不分块直接创建duckdb索引 此工具用于识别哪些文件需要创建duckdb索引
FormatChecker.check_index_generate_format()
header_detector.py
检测csv等文件表头列名工具类
csv_chunk_read_util
分块读取csv工具类,用于大文件读取并分块
./dataLoader
base_file_loader.py
文件读取基类,包含read_and_convert读取转换抽象方法,需要子类实现
get_invalid_chars # 获取文件命名非法字符
_split_dataframe_by_size # df分割
file_loader_factory
文件读取类的工厂类,根据传入需要读取的类型生成对应的数据loader
excel_loader
excel类型读取转换逻辑
hdf5_loader
hdf5类型文件读取转换
single_table_loader
'.csv', '.sav', '.tsv', '.ods', '.parquet', '.tab' 类型转换
./dataWriter
base_writer_utils
基础输出写文件类
包含通用方法
_analyze_all_schemas # 分析分块schema
auto_convert_column # 更安全的类型转换逻辑
clean_dataframe # 清洗dataframe
write_dataframe_in_chunks # 分块写入文件
save_as_output_format_with_chunks #输出
parquet_writer.py
parquet输出
pyarrow_writer.py
pyarrow输出
./despatch
./duckDBUtils
my_parquet_fts_indexer_upgrade
读取parquet文件,识别字符串数据类型数据,并创建duckdb全文索引,输出duckdb文件
DuckDBFTSIndexer.create_fts_index()
每个duckdb中数据表名称固定为: data 生成的全文索引名称固定为:fts_main_data
统计文件statics.json类型
"class_label",
"float",
"int",
"string_label",
"string_text",
"bool",
"list",
"datetime",
"unknown"
文本类文件转Markdown
ProcessText2MD.py
直接调用
process_text_file(input_file_path, output_path)
即可,目前支持的文件类型:
pdf
pptx
ppt
docx
doc
html
epub
使用的marker需要安装的依赖:
conda install weasyprint -c conda-forge
conda install Pango -c conda-forge
conda install fontTools -c conda-forge
conda install -c conda-forge fontconfig
conda install -c conda-forge pillow
conda install -c conda-forge freetype libpng
conda install transformers -c conda-forge
conda install tensorflow -c conda-forge
# 核心依赖
pip install marker-pdf[full]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file table_file_parser-0.1.5-py3-none-any.whl.
File metadata
- Download URL: table_file_parser-0.1.5-py3-none-any.whl
- Upload date:
- Size: 63.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7138a86260c7019b045cc2f95dd0a46eb290f763e4905f35b01c9ffce9da2b7a
|
|
| MD5 |
e9324ab09392fc1f7626a7f0bcb43dcc
|
|
| BLAKE2b-256 |
1f0c936c33a359a84c0214c206d466c284ee1ff1c0c0b51b558802a9dc7afa07
|