Skip to main content

content tagging and index generator with maxcompute

Project description

tagging & Index Generating for firm level data with maxcompute

initialize maxcompute account

  • Install Aliyun CLI: Install guide
  • run the aliyun configure command to setup account
$ aliyun configure
Configuring profile 'default' ...
Aliyun Access Key ID [None]: <Your AccessKey ID>
Aliyun Access Key Secret [None]: <Your AccessKey Secret>
Default Region Id [None]: cn-zhangjiakou
Default output format [json]: json
Default Language [zh]: zh

define Tags

add new configs

  • by csv, folder should include 3 files: - tag_list.csv - prefix.csv - suffix.csv
from tagging_index.tag_processor import TagProcessor
import os
processor = TagProcessor()
tag_config_folder = os.path.join(os.getcwd(), "tag_config")
processor.append_new_config_csv(tag_config_folder)
  • by json
from tagging_index.tag_processor import TagProcessor
import os
processor = TagProcessor()
tag_config_file = os.path.join(os.getcwd(), "tag_config.json")
processor.append_new_config_json(tag_config_file)

add current config

load existing config from maxcompute or local json file to compare with new config

# load the lastest version from maxcompute
processor.load_current_config()
# load the certain version from maxcompute
processor.load_current_config("202401010111")

validate config

validate and print tag tree

validate_result = processor.validate()
pprint(validate_result)
processor.show_tree(root_tag="tag_value",levels=1)

save config

processor.save_to_json(os.path.join(os.getcwd(), "new_config.json"))
# create and save to a new version in maxcompute
processor.save_to_version()

update tag config for udf resource

from tagging_index.maxcompute.udf_release import UdfRelease

udf = UdfRelease()
# release udf only when _udf module updated.
udf.release_udf()
# default to use lastest version
udf.update_dim_resource(version="")

index generation

please notice you need to update the tagging result in maxcompute before generate index

  • define index, refer to [index_tag_schema.json]
from tagging_index.index_generator import DemandIndexGenerator,TalentIndexGenerator
DemandIndexGenerator.get_index_schema()
  • generate index
demand_index = DemandIndexGenerator("index_tag_definition.json")
talent_index = TalentIndexGenerator("index_tag_definition.json")
# list index code with index type suffix
print(talent_index.index_codes)
# set index range
talent_index.start_year = 2018
talent_index.end_year = 2019
# datasource version
talent_index.tag_udf_version="20240604110353.8@6@6"
# check sql script
print(demand_index.generate_sql(['IT_total']).get('IT_total'))
# generate index data and return dataframe
# talent_index.get_index_data('IT_total')
# generate index data and save in maxcompute, ignore index_codes param to generate all
talent_index.generate_index()
# generate the firm level total count in the datasource
talent_index.generate_index_ttl()
  • generate panel data from index data and maxtrix varibles
from tagging_index.data_generator import PanelDataGenerator, VariableMapOther
from tagging_index.index_generator import DemandIndexGenerator

panel_data = PanelDataGenerator()
# set empty array for all comps
panel_data.comp_ids = ['603893.SH', '300158.SZ', "000001.SZ"]
panel_data.start_year=2019
panel_data.end_year=2020
panel_data.index_version = "<<index_version>>"
# add index
panel_data.add_index('IT_total_T')
panel_data.add_index('IT_total_D')
# add source base index (total count)
panel_data.add_source_base(DemandIndexGenerator,'demand_total')
# add performance matrix
panel_data.add_matrix(code='Y0601b',column_name='emp_no')
panel_data.add_matrix('F100801A', 'mkt_value')
# add additional variable from ods table
basic_info ="(select * from ods_csmar_ipo_cobasic where pt=max_pt('ods_csmar_ipo_cobasic'))"
panel_data.add_other_var(VariableMapOther(
    basic_info
    ,'estbdt'
    ,dim_comp_id='stock_id'
    ,col_comp_id='stkcd'))
print(panel_data.get_panel_sql())
panel_data.get_result_df().tail(500)
panel_data.save_to_csv("panel_data.csv")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tagging_index-0.0.4b5.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tagging_index-0.0.4b5-py3-none-any.whl (33.1 kB view details)

Uploaded Python 3

File details

Details for the file tagging_index-0.0.4b5.tar.gz.

File metadata

  • Download URL: tagging_index-0.0.4b5.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for tagging_index-0.0.4b5.tar.gz
Algorithm Hash digest
SHA256 c0f93ae27cb461448b42f299a8809c3f75f753885037924cec6f8999ffd2633e
MD5 cf6b745fd491e4219d2474a83adefee4
BLAKE2b-256 81995ee207d68b0a199ae8b1ed4858cfe9ad1e0f0be574e76d89b86cf32720d7

See more details on using hashes here.

File details

Details for the file tagging_index-0.0.4b5-py3-none-any.whl.

File metadata

File hashes

Hashes for tagging_index-0.0.4b5-py3-none-any.whl
Algorithm Hash digest
SHA256 9ba1f5979ba67bf5d52f9891915c4101e17d28e74897130bbefc439a16a4f21d
MD5 21e97f58487233e0b043839e5e3f7b19
BLAKE2b-256 d530a49b23bb8141d99e214ed8ffa3e5b7122b795a870a9820b8b7ab5ac4c0f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page