content tagging and index generator with maxcompute
Project description
tagging & Index Generating for firm level data with maxcompute
initialize maxcompute account
- Install Aliyun CLI: Install guide
- run the aliyun configure command to setup account
$ aliyun configure
Configuring profile 'default' ...
Aliyun Access Key ID [None]: <Your AccessKey ID>
Aliyun Access Key Secret [None]: <Your AccessKey Secret>
Default Region Id [None]: cn-zhangjiakou
Default output format [json]: json
Default Language [zh]: zh
define Tags
add new configs
- by csv, folder should include 3 files: - tag_list.csv - prefix.csv - suffix.csv
from tagging_index.tag_processor import TagProcessor
import os
processor = TagProcessor()
tag_config_folder = os.path.join(os.getcwd(), "tag_config")
processor.append_new_config_csv(tag_config_folder)
- by json
from tagging_index.tag_processor import TagProcessor
import os
processor = TagProcessor()
tag_config_file = os.path.join(os.getcwd(), "tag_config.json")
processor.append_new_config_json(tag_config_file)
add current config
load existing config from maxcompute or local json file to compare with new config
# load the lastest version from maxcompute
processor.load_current_config()
# load the certain version from maxcompute
processor.load_current_config("202401010111")
validate config
validate and print tag tree
validate_result = processor.validate()
pprint(validate_result)
processor.show_tree(root_tag="tag_value",levels=1)
save config
processor.save_to_json(os.path.join(os.getcwd(), "new_config.json"))
# create and save to a new version in maxcompute
processor.save_to_version()
update tag config for udf resource
from tagging_index.maxcompute.udf_release import UdfRelease
local_result = udf.test_local(content, 0)
udf = UdfRelease()
# release udf only when _udf module updated.
udf.release_udf()
# default to use lastest version
udf.update_dim_resource(version="")
test = udf.test_udf(content, 0)
print(test)
index generation
please notice you need to update the tagging result in maxcompute before generate index
- define index, refer to [index_tag_schema.json]
from tagging_index.index_generator import DemandIndexGenerator,TalentIndexGenerator
DemandIndexGenerator.get_index_schema()
-
generate index
- use predefine generator
demand_index = DemandIndexGenerator("index_tag_definition.json")
talent_index = TalentIndexGenerator("index_tag_definition.json")
# list index code with index type suffix
print(talent_index.index_codes)
# set index range
talent_index.start_year = 2018
talent_index.end_year = 2019
# datasource version
talent_index.tag_udf_version="20240604110353.8@6@6"
# check sql script
print(demand_index.generate_sql(['IT_total']).get('IT_total'))
# generate index data and return dataframe
# talent_index.get_index_data('IT_total')
# generate index data and save in maxcompute, ignore index_codes param to generate all
talent_index.generate_index()
# generate the firm level total count in the datasource
talent_index.generate_index_ttl()
- use general generator
from tagging_index.index_generator import (
GeneralIndexGenerator,
SourceTable,
TagSourceTable,
TagIndexConfig,
)
source_table = SourceTable(
"dwd_china_listed_company_roadshow_qna",
"full_stock_id",
"lst_com",
"qna_id",
"to_char(show_date,'YYYY')",
"to_char(show_date,'YYYY')",
"pt='20240301'",
)
tag_source_table = TagSourceTable(
"dwd_tag_lst_roadshow_china",
"full_stock_id",
"lst_com",
"qna_id",
"show_year",
"show_year",
)
tag_idx_cfg: TagIndexConfig = TagIndexConfig(
"earnings-call", "E", tag_source_table, source_table
)
tag_desc_file = (
f"{pathlib.Path(os.getcwd())}/index_definition/root_kw0604.json"
)
gen = GeneralIndexGenerator(tag_desc_file,tag_idx_cfg)
# gen.generate_index_ttl()
print(list(gen.generate_sql(["AI_CNpatent_Babina_Alekseeva_0604"]).values())[0])
generate panel data from index data and maxtrix varibles
from tagging_index.data_generator import PanelDataGenerator, VariableMapOther
from tagging_index.index_generator import DemandIndexGenerator
panel_data = PanelDataGenerator()
# set empty array for all comps
panel_data.comp_ids = ['603893.SH', '300158.SZ', "000001.SZ"]
panel_data.start_year=2019
panel_data.end_year=2020
panel_data.index_version = "{your_index_version}"
# add index
panel_data.add_index('IT_total_T')
panel_data.add_index('IT_total_D')
# add source base index (total count)
panel_data.add_source_base("demand",'demand_total')
# add performance matrix
panel_data.add_matrix(code='Y0601b',column_name='emp_no')
panel_data.add_matrix('F100801A', 'mkt_value')
# add additional variable from ods table
basic_info ="(select * from ods_csmar_ipo_cobasic where pt=max_pt('ods_csmar_ipo_cobasic'))"
panel_data.add_other_var(VariableMapOther(
basic_info
,'estbdt'
,dim_comp_id='stock_id'
,col_comp_id='stkcd'))
print(panel_data.get_panel_sql())
panel_data.get_result_df().tail(500)
panel_data.save_to_csv("panel_data.csv")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tagging_index-0.0.5.tar.gz
(29.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tagging_index-0.0.5.tar.gz.
File metadata
- Download URL: tagging_index-0.0.5.tar.gz
- Upload date:
- Size: 29.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d3e5b8e78076d4cc5d35ef8a56e82e8e2509c23e58e4ed33d38e41dcac57cd9
|
|
| MD5 |
b27c2584242640d9e01088fcd115b13a
|
|
| BLAKE2b-256 |
156d5bd4938ee29bf6ef3c6dce1a2849962c1221b5f5310fc8529b045d50f76a
|
File details
Details for the file tagging_index-0.0.5-py3-none-any.whl.
File metadata
- Download URL: tagging_index-0.0.5-py3-none-any.whl
- Upload date:
- Size: 34.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a888f8767ad40bf9f1bcabd410ba760bb237f9b581bcf2a494a21e22a903a5c
|
|
| MD5 |
459ce88454dbca0866ed24ffc9a06e12
|
|
| BLAKE2b-256 |
6d2930d9f7bd981d9a070afacca96fac278915dd4c5f3768be799eb884d8e37a
|