Skip to main content

content tagging and index generator with maxcompute

Project description

tagging & Index for firm level data with maxcompute

initialize maxcompute account

  • Install Aliyun CLI: Install guide
  • run the aliyun configure command to setup account
$ aliyun configure
Configuring profile 'default' ...
Aliyun Access Key ID [None]: <Your AccessKey ID>
Aliyun Access Key Secret [None]: <Your AccessKey Secret>
Default Region Id [None]: cn-zhangjiakou
Default output format [json]: json
Default Language [zh]: zh

define Tags

add new configs

  • by csv, folder should include 3 files: - tag_list.csv - prefix.csv - suffix.csv
from tagging_index.tag_processor import TagProcessor
import os
processor = TagProcessor()
tag_config_folder = os.path.join(os.getcwd(), "tag_config")
processor.append_new_config_csv(tag_config_folder)
  • by json
from tagging_index.tag_processor import TagProcessor
import os
processor = TagProcessor()
tag_config_file = os.path.join(os.getcwd(), "tag_config.json")
processor.append_new_config_json(tag_config_file)

add current config

load existing config from maxcompute or local json file to compare with new config

# load the lastest version from maxcompute
processor.load_current_config()
# load the certain version from maxcompute
processor.load_current_config("202401010111")

validate config

validate and print tag tree

validate_result = processor.validate()
pprint(validate_result)
processor.show_tree(root_tag="tag_value",levels=1)

save config

processor.save_to_json(os.path.join(os.getcwd(), "new_config.json"))
# create and save to a new version in maxcompute
processor.save_to_version()

update tag config for udf resource

from tagging_index.maxcompute.udf_release import UdfRelease

udf = UdfRelease()
# release udf only when _udf module updated.
udf.release_udf()
# default to use lastest version
udf.update_dim_resource(version="")

index generation

please notice you need to update the tagging result in maxcompute before generate index

  • define index, refer to [index_tag_schema.json]
from tagging_index.index_generator import DemandIndexGenerator,TalentIndexGenerator
DemandIndexGenerator.get_index_schema()
  • generate index
demand_index = DemandIndexGenerator("index_tag_definition.json")
talent_index = TalentIndexGenerator("index_tag_definition.json")
# list index code with index type suffix
print(talent_index.index_codes)
# set index range
talent_index.start_year = 2018
talent_index.end_year = 2019
# datasource version
talent_index.tag_udf_version="20240604110353.8@6@6"
# check sql script
print(demand_index.generate_sql(['IT_total']).get('IT_total'))
# generate index data and return dataframe
# talent_index.get_index_data('IT_total')
# generate index data and save in maxcompute, ignore index_codes param to generate all
talent_index.generate_index()
# generate the firm level total count in the datasource
talent_index.generate_index_ttl()
  • generate panel data from index data and maxtrix varibles
from tagging_index.data_generator import PanelDataGenerator, VariableMapOther

panel_data = PanelDataGenerator()
panel_data.index_version = "index_version"
panel_data.add_index('IT_total_T')
panel_data.add_index('IT_total_D')

panel_data.add_matrix(code='Y0601b',column_name='emp_no')
panel_data.add_matrix('F100801A', 'mkt_value')
panel_data.add_other_var(VariableMapOther(
    basic_info
    ,'estbdt'
    ,dim_comp_id='stock_id'
    ,col_comp_id='stkcd'))
sql=panel_data.get_panel_sql()
print(sql)
panel_data.get_result_df().tail(500)
panel_data.save_to_csv("panel_data.csv")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tagging_index-0.0.2b0.tar.gz (31.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tagging_index-0.0.2b0-py3-none-any.whl (38.4 kB view details)

Uploaded Python 3

File details

Details for the file tagging_index-0.0.2b0.tar.gz.

File metadata

  • Download URL: tagging_index-0.0.2b0.tar.gz
  • Upload date:
  • Size: 31.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for tagging_index-0.0.2b0.tar.gz
Algorithm Hash digest
SHA256 470650d0bc028cf09a5eec7e460934bfe7d36cf28744900d8ac9117d0bd39e58
MD5 38b9e2bf64f937c9e3c5fb477427013c
BLAKE2b-256 96889730f7fb474bc0a0717ec6b4a8e09c6b969517931d2c5137b62fd8cd5116

See more details on using hashes here.

File details

Details for the file tagging_index-0.0.2b0-py3-none-any.whl.

File metadata

File hashes

Hashes for tagging_index-0.0.2b0-py3-none-any.whl
Algorithm Hash digest
SHA256 774e00eb1f846a76307c5d775ced866ce91dc1b4b826ca15a5097653c693d5c2
MD5 2285d748c0519af533226fc88ee02291
BLAKE2b-256 72b04322c911c150cedf774247c718347e727eda943afd9d8b018f211536c0bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page