数据分析过程工具整合包

These details have not been verified by PyPI

Project description

angels

数据分析过程工具整合包，包含数据获取、数据清洗、逻辑加工、分析算法、可视化等五个核心模块。

安装

pip install angels

打包说明

本项目使用现代的 pyproject.toml 打包方式，替代了传统的 setup.py。

核心模块

1. 数据获取 (data_acquisition)

文件加载

load_csv(file_path): 加载CSV文件并返回DataFrame
load_excel(file_path, sheet_name=0): 加载Excel文件并返回DataFrame
load_json(file_path): 加载JSON文件并返回DataFrame

API/Web数据

load_from_api(url, params=None): 从API获取数据并返回DataFrame
load_from_web(url, table_selector): 从网页表格获取数据并返回DataFrame

数据库操作

get_db_config(): 获取数据库配置
fetch_dataframe(sql, db_name='rpt'): 从数据库执行SQL查询并返回DataFrame
fetch_data_from_db(tbl_name, order_sql, db_name, batch_size=20000, batch_no=0): 从数据库批量获取数据并保存为Parquet文件
concat_tbl_data(tbl_name): 合并多个Parquet文件为一个文件
save_df_to_database(df, db_name, table_name, if_exists='replace', ...): 保存DataFrame到数据库

日期工具

get_last_day_of_previous_month(): 获取上个月的最后一天

消息与加密

send_dingtalk_message(status, user_name, content, ding_url=None): 发送钉钉消息
encrypt_dict_aes(key, data): 使用AES加密字典数据
decrypt_dict_aes(key, encrypted_data): 使用AES解密字典数据

LLM接口

call_llm_api(url, model, messages, headers=None, temperature=0.7, max_tokens=1000): 调用LLM API

2. 数据清洗 (data_cleaning)

基础清洗

remove_duplicates(df): 移除重复行
handle_missing_values(df, strategy='drop', fill_value=None): 处理缺失值
- strategy: 'drop'删除, 'fill'填充, 'mean'均值, 'median'中位数
convert_data_types(df, dtypes=None): 转换数据类型
remove_outliers(df, columns=None, method='iqr', threshold=1.5): 移除异常值
- method: 'iqr'四分位距法, 'zscore'Z分数法
standardize_columns(df): 标准化列名

归一化

max_min_scaler(col): 最大最小值归一化
fill_missing_with_decay(df, col, year_col='Year', month_col='Month', decay_factor=0.8): 使用衰减因子填充缺失值

文本清洗

clean_chinese_class_text(text): 清洗中文分类文本
clean_english_text(text, stopwords=None, remove_punctuation=True, remove_digits=False): 清洗英文文本
detect_language(text, min_length=5): 检测文本语言

异常值处理

remove_outliers_iqr(series, multiplier=3): 使用IQR方法移除异常值
remove_leading_nans(group, col): 移除开头的NaN值

统计工具

calculate_mode(series): 计算众数

3. 逻辑加工 (logic_processing)

分组与聚合

group_by_aggregate(df, group_by, aggregations): 按列分组并聚合
pivot_table(df, index, columns, values, aggfunc='mean'): 创建透视表
calculate_group_percentage(df, group_cols, value_col): 计算组内百分比

滚动统计

calculate_rolling_stats(df, column, window, stats=['mean', 'std']): 计算滚动统计量
calculate_diff(df, column, periods=1): 计算差值

数据合并

merge_dataframes(df1, df2, on=None, how='inner'): 合并两个数据框
merge_multiple_dataframes(df_list, on_cols, how='outer'): 合并多个数据框

特征工程

create_features(df): 创建特征（对数变换、平方）

时间周期

generate_half_year_periods(start_year=2020, end_year=2025): 生成半年度统计期间列表

业务指标计算

calculate_completion_rate_3years(df, date_col, finish_col, group_cols, ...): 计算3年完成率
calculate_once_confirm_rate(df, group_cols, design_qty_col, confirm_col): 计算一次确认率
calculate_doctor_repay_rate(df, group_cols, date_col, doc_level_col, doc_level_days=None): 计算医生回报率

数据处理工具

timer_decorator(func): 计时装饰器
calculate_position_distance(dict1, dict2): 计算位置距离
find_duplicate_patients(df, group_cols, time_col, threshold_days): 查找重复患者
linear_programming_optimize(objective_coeffs, constraints, bounds=None, maximize=True): 线性规划求解（使用PuLP库）
- 来源项目: linear_program, 工单分配

4. 分析算法 (analysis_algorithms)

描述性统计

descriptive_statistics(df): 计算描述性统计量
correlation_analysis(df, method='pearson'): 计算相关性矩阵

聚类分析

kmeans_clustering(df, n_clusters=3, random_state=42): K均值聚类

回归分析

linear_regression(X, y): 线性回归

时间序列

time_series_analysis(df, time_column, value_column): 时间序列分析（移动平均、同比增长）

假设检验

hypothesis_testing(sample1, sample2): 假设检验（t检验）

正态性检验

check_normality(data, sample_size_threshold=5000, method='auto', alpha=0.05, plot=True, ...): 综合正态性检验
- method: 'auto'自动选择, 'shapiro' Shapiro-Wilk, 'lilliefors' Lilliefors, 'anderson' Anderson-Darling
- 来源项目: Sissi's statistical analysis

组间比较

compare_groups_single_factor(df, group_col, value_col, alpha=0.05, normality_threshold=5000): 单因素组间比较
- 自动选择检验方法：正态分布用ANOVA，非正态用Kruskal-Wallis
- 来源项目: Sissi's statistical analysis, 一体病例阶段反馈

报告生成

generate_pdf_report(df, group_col, value_col, output_file='report.pdf', ...): 生成PDF统计报告
- 来源项目: Sissi's statistical analysis

机器学习 - 分类

train_xgboost_classifier(X, y, params=None, save_model_path=None): 训练XGBoost分类器
- 来源项目: doc_type_model, designer_profile_weights

机器学习 - 回归

train_random_forest_regressor(X, y, n_estimators=5, max_depth=4, random_state=42): 训练随机森林回归器
- 来源项目: designer_profile_weights
train_decision_tree_regressor(X, y, max_leaf_nodes=4, min_samples_leaf=0.05): 训练决策树回归器
- 来源项目: designer_profile_weights

模型解释

shap_analysis(model, X_test, feature_names=None): SHAP值分析与可视化
- 来源项目: doc_type_model
plot_feature_importance(model, feature_names, figsize=(6, 10)): 绘制特征重要性图

特征工程

tree_binning(X, y, n_bins=4): 决策树分箱
- 来源项目: designer_profile_weights

5. 可视化 (visualization)

基础图表

plot_histogram(df, column, bins=30, title=None): 绘制直方图
plot_scatter(df, x, y, hue=None, title=None): 绘制散点图
plot_bar(df, x, y, title=None): 绘制柱状图
plot_box(df, x, y=None, title=None): 绘制箱线图
plot_time_series(df, x, y, title=None): 绘制时间序列图

统计可视化

plot_correlation_heatmap(df, method='pearson', figsize=(12, 10), ...): 绘制相关性热图
plot_clustermap(df, method='single', metric='euclidean', ...): 绘制聚类热图
plot_distribution(df, col, plot_type='histogram', bins=30, ...): 绘制分布图

分组比较

plot_group_comparison(df, group_col, value_col, res, ...): 绘制组间比较图（带统计标注）
- 来源项目: Sissi's analysis, doc_type_model
plot_bar_by_group(df, group_col, value_col, agg_func='mean', ...): 按分组绘制柱状图

专业图表

plot_stacked_barh(df, category_col, value_col, group_col, colors=None, ...): 绘制堆叠水平柱状图
plot_colored_bar(values, colors, figsize=(8, 2), save_path=None): 绘制着色柱状图
plot_score_histogram(df, score_col, hue_col=None, bins=20, ...): 绘制分数直方图
plot_with_error_band(x, y, error, figsize=(10, 6), title=None, ...): 绘制带误差带的折线图

使用示例

from angels import (
    # 数据获取
    load_csv, fetch_dataframe, 
    # 数据清洗
    remove_duplicates, remove_outliers_iqr,
    # 逻辑加工
    group_by_aggregate, calculate_group_percentage,
    # 分析算法
    check_normality, compare_groups_single_factor, train_xgboost_classifier,
    # 可视化
    plot_group_comparison, plot_correlation_heatmap
)

# 1. 数据获取
df = load_csv('data.csv')
# 或从数据库获取
df = fetch_dataframe("SELECT * FROM table", db_name='rpt')

# 2. 数据清洗
df = remove_duplicates(df)
df_clean, mask = remove_outliers_iqr(df['value'])

# 3. 正态性检验
result = check_normality(df['value'], plot=True)

# 4. 组间比较
comp_result = compare_groups_single_factor(df, group_col='group', value_col='value')

# 5. 可视化
plot_group_comparison(df, 'group', 'value', comp_result)

依赖

依赖包	用途
pandas	数据处理
numpy	数值计算
matplotlib	绑图基础库
seaborn	统计可视化
scikit-learn	机器学习
scipy	科学计算
statsmodels	统计建模
pingouin	统计检验
scikit-posthocs	事后检验
xgboost	XGBoost模型
shap	模型解释
requests	HTTP请求
beautifulsoup4	网页解析
pymysql	MySQL连接
pyarrow	Parquet文件支持
pycryptodome	AES加密
pulp	线性规划

版本历史

v0.1.3: 新增pulp依赖（线性规划）
v0.1.2: 新增xgboost、shap、pycryptodome依赖，扩展150+函数
v0.1.1: 初始版本，核心数据处理功能

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.3

Apr 3, 2026

0.1.2

Mar 24, 2026

0.1.1

Mar 24, 2026

0.1.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

angels-0.1.3.tar.gz (57.4 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

angels-0.1.3-py3-none-any.whl (56.6 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file angels-0.1.3.tar.gz.

File metadata

Download URL: angels-0.1.3.tar.gz
Upload date: Apr 3, 2026
Size: 57.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for angels-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`290baa4db090235e04142a105ee5b4ba119e632a9b8622789a4d46811b72a2e5`
MD5	`e9466a3ce8fc14cd38f1bbcc75757568`
BLAKE2b-256	`ac541f48355054cdf51bb35091fa7ded8af9f66e3c7ff73bf5ca76edd572a607`

See more details on using hashes here.

File details

Details for the file angels-0.1.3-py3-none-any.whl.

File metadata

Download URL: angels-0.1.3-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 56.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for angels-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c05b7c76102f9e601c5d9b2668bbea0537e9cb94800f7a1bde8ba4fda1f69788`
MD5	`089ded016267bd495a09d2334b77fc2e`
BLAKE2b-256	`8d3c365b10afa9c36a0afc12a623cf98783ac464eb3897a3288a7efa2091406f`

See more details on using hashes here.

angels 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

angels

安装

打包说明

核心模块

1. 数据获取 (data_acquisition)

文件加载

API/Web数据

数据库操作

日期工具

消息与加密

LLM接口

2. 数据清洗 (data_cleaning)

基础清洗

归一化

文本清洗

异常值处理

统计工具

3. 逻辑加工 (logic_processing)

分组与聚合

滚动统计

数据合并

特征工程

时间周期

业务指标计算

数据处理工具

4. 分析算法 (analysis_algorithms)

描述性统计

聚类分析

回归分析

时间序列

假设检验

正态性检验

组间比较

报告生成

机器学习 - 分类

机器学习 - 回归

模型解释

特征工程

5. 可视化 (visualization)

基础图表

统计可视化

分组比较

专业图表

使用示例

依赖

版本历史

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes