content tagging and index generator with maxcompute
Project description
tagging & Index for firm level data with maxcompute
initialize maxcompute account
- Install Aliyun CLI: Install guide
- run the aliyun configure command to setup account
$ aliyun configure
Configuring profile 'default' ...
Aliyun Access Key ID [None]: <Your AccessKey ID>
Aliyun Access Key Secret [None]: <Your AccessKey Secret>
Default Region Id [None]: cn-zhangjiakou
Default output format [json]: json
Default Language [zh]: zh
define Tags
add new configs
- by csv, folder should include 3 files: - tag_list.csv - prefix.csv - suffix.csv
from tagging_index.tag_processor import TagProcessor
import os
processor = TagProcessor()
tag_config_folder = os.path.join(os.getcwd(), "tag_config")
processor.append_new_config_csv(tag_config_folder)
- by json
from tagging_index.tag_processor import TagProcessor
import os
processor = TagProcessor()
tag_config_file = os.path.join(os.getcwd(), "tag_config.json")
processor.append_new_config_json(tag_config_file)
add current config
load existing config from maxcompute or local json file to compare with new config
# load the lastest version from maxcompute
processor.load_current_config()
# load the certain version from maxcompute
processor.load_current_config("202401010111")
validate config
validate and print tag tree
validate_result = processor.validate()
pprint(validate_result)
processor.show_tree(root_tag="tag_value",levels=1)
save config
processor.save_to_json(os.path.join(os.getcwd(), "new_config.json"))
# create and save to a new version in maxcompute
processor.save_to_version()
update tag config for udf resource
from tagging_index.maxcompute.udf_release import UdfRelease
udf = UdfRelease()
# release udf only when _udf module updated.
udf.release_udf()
# default to use lastest version
udf.update_dim_resource(version="")
index generation
please notice you need to update the tagging result in maxcompute before generate index
- define index, refer to [index_tag_schema.json]
from tagging_index.index_generator import DemandIndexGenerator,TalentIndexGenerator
DemandIndexGenerator.get_index_schema()
- generate index
demand_index = DemandIndexGenerator("index_tag_definition.json")
talent_index = TalentIndexGenerator("index_tag_definition.json")
# list index code with index type suffix
print(talent_index.index_codes)
# set index range
talent_index.start_year = 2018
talent_index.end_year = 2019
# datasource version
talent_index.tag_udf_version="20240604110353.8@6@6"
# check sql script
print(demand_index.generate_sql(['IT_total']).get('IT_total'))
# generate index data and return dataframe
# talent_index.get_index_data('IT_total')
# generate index data and save in maxcompute, ignore index_codes param to generate all
talent_index.generate_index()
# generate the firm level total count in the datasource
talent_index.generate_index_ttl()
- generate panel data from index data and maxtrix varibles
from tagging_index.data_generator import PanelDataGenerator, VariableMapOther
panel_data = PanelDataGenerator()
panel_data.index_version = "index_version"
panel_data.add_index('IT_total_T')
panel_data.add_index('IT_total_D')
panel_data.add_matrix(code='Y0601b',column_name='emp_no')
panel_data.add_matrix('F100801A', 'mkt_value')
panel_data.add_other_var(VariableMapOther(
basic_info
,'estbdt'
,dim_comp_id='stock_id'
,col_comp_id='stkcd'))
sql=panel_data.get_panel_sql()
print(sql)
panel_data.get_result_df().tail(500)
panel_data.save_to_csv("panel_data.csv")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tagging_index-0.0.2b0.tar.gz
(31.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tagging_index-0.0.2b0.tar.gz.
File metadata
- Download URL: tagging_index-0.0.2b0.tar.gz
- Upload date:
- Size: 31.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
470650d0bc028cf09a5eec7e460934bfe7d36cf28744900d8ac9117d0bd39e58
|
|
| MD5 |
38b9e2bf64f937c9e3c5fb477427013c
|
|
| BLAKE2b-256 |
96889730f7fb474bc0a0717ec6b4a8e09c6b969517931d2c5137b62fd8cd5116
|
File details
Details for the file tagging_index-0.0.2b0-py3-none-any.whl.
File metadata
- Download URL: tagging_index-0.0.2b0-py3-none-any.whl
- Upload date:
- Size: 38.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.9.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
774e00eb1f846a76307c5d775ced866ce91dc1b4b826ca15a5097653c693d5c2
|
|
| MD5 |
2285d748c0519af533226fc88ee02291
|
|
| BLAKE2b-256 |
72b04322c911c150cedf774247c718347e727eda943afd9d8b018f211536c0bc
|