Skip to main content

Experiment toolkits

Project description

Introduction for Cof utils

There're several useful tools for experiments, such as cofrun, coftimer, cofmem, cofwriter. The Overview of Cofutils overview

Install

By Pypi

pip install cofutils

By Source

git clone https://gitee.com/haiqwa/cofutils.git
pip install .

Usage

Cof Writer

Cof Logger

Cof logger can print user message according to print-level. In *.py:

from cofutils import coflogger
coflogger.debug("this is debug")
coflogger.info("this is info")
coflogger.warn("this is warn")
coflogger.error("this is error")

Print-level is determined by environment variable COF_DEBUG:

COF_DEBUG=WARN python main.py

The default print-level is INFO. By the way, only the node of 'rank=0' can output log in distributed environment

Cof CSV

Dump data into csv format.

  • Get a unique csv writer by calling cofcsv
  • Write data in dict type. You can append data at anywhere and anytime
  • Save data as [name].csv under the root_dir. After that cofcsv will clear data in default
from cofutils import cofcsv

data = {'a':1, 'b':2, 'c':3}
test_csv = cofcsv('test')
test_csv.write(data)
data = {'a':4, 'b':5, 'c':6}
test_csv.write(data)

# remember to save data by calling cofcsv.save
cofcsv.save(root_dir='csv_output')

Cof Tb

Write data into tensorboard.

from cofutils import coftb
coftb('test')
coftb.write({'a': 10})
coftb.write({'a': 20})
coftb.write({'a': 30})
coftb.close()

By default, events.out.tfevents.xxx would be dump to coftb directory.

tensorboard --logdir coftb/

Cof Timer

Cof timer is similar to the Timer in Megatron-LM. By default, the timer achieves the duration time of operations on the host side. If you want to profile cuda programme, please set cuda_timer=True, which obtains execution time by cuda events.

It support two log modes which can be set by the keyword timedict:

  • Organize the result into a string and output it into STDOUT which is easy to view for users
  • Directly return the result time table as Dict format

Users can also customize their time log writer by setting writer. Currently, cof timer supports csv, tb, info, debug, warn, error as writer function.

Note: if you call .log to print time, then the timer will reset automatically

from cofutils import coftimer, coflogger, coftb, cofcsv
import time
import torch
coftimer.set_writer(writer = "warn,csv,tb", name="loop_sleep")
test_1 = coftimer('test1')
test_2 = coftimer('test2')
test_3 = coftimer('test3', cuda_timer=True)


for _ in range(3):
    test_1.start()
    time.sleep(1)
    test_1.stop()

coftimer.log(normalizer=3, timedict=False)

with test_2:
    for _ in range(3):
        time.sleep(1)

coftimer.log(normalizer=3, timedict=False)

m1 = torch.randn(1024,1024,16,device="cuda:0")
m2 = torch.randn(1024,1024,16,device="cuda:0")
with test_3:
    for _ in range(3):
        m1 = m1+m2
        m1.div_(20)
        m2.div_(10)
time_dict = coftimer.log(normalizer=3, timedict=True)
coflogger.info(time_dict)
cofcsv.save()
[2023-11-21 23:08:33.670]  [Cof INFO]: time (ms) | test1: 1001.15 | test2: 0.00
[2023-11-21 23:08:36.674]  [Cof WARNING]: time (ms) | test1: 1001.11 | test2: 0.00
[2023-11-21 23:08:39.678]  [Cof INFO]: {'test1': 0.0, 'test2': 1001.1359850565592}

Cof Memory Report

Print GPU memory states by pytorch cuda API. And it supports to dump memory states into tensorboard of csv, except for printing out to the terminal.

  • MA: memory current allocated
  • MM: max memory allocated
  • MR: memory reserved by pytorch

cofmem is a time-cost API. Please remember to remove it if you want to profiling the performance of program. Similarly, you can set writer for cofmem. cofmem would do nothing if you set writer as None.

The latency of cofmem:

writer latency
None 0ms
logger.info 0.8ms
tensorboard 2.8ms
csv 0.5ms
from cofutils import cofmem, cofcsv, coftimer
import torch
cofmem.set_writer('tb,csv', name="test-1")
coftimer.set_writer('tb,csv', name="test-1")
timer = coftimer(name='test-1')
cofmem("Before Init Random Tensor")
tensor1 = torch.rand((1024, 1024, 128), dtype=torch.float32, device='cuda:0')
tensor2 = torch.rand((1024, 1024, 128), dtype=torch.float32, device='cuda:0')


with timer:
    cofmem("After Init Random Tensor")
    add_result = tensor1 + tensor2
    cofmem("After Addition")

    subtract_result = tensor1 - tensor2
    cofmem("After Subtraction")

    multiply_result = tensor1 * tensor2
    cofmem("After Multiplication")

    divide_result = tensor1 / tensor2
    cofmem("After Division")

coftimer.log()
cofcsv.save()

Note that cofmem would return a dict which contains memory report.

(deepspeed) haiqwa@gpu9:~/documents/cofutils$ python ~/test.py 
[2023-11-11 15:32:46.873]  [Cof INFO]: before xxx GPU Memory Report (GB): MA = 0.00 | MM = 0.00 | MR = 0.00
[2023-11-11 15:32:46.873]  [Cof INFO]: after xxx GPU Memory Report (GB): MA = 0.00 | MM = 0.00 | MR = 0.00

Cofrun is all you need!

User can easily launch distributed task by cofrun. What users need to do is to provide a template bash file and configuration json file.

You can see the examples in example/

usage: cofrun [-h] [--file FILE] [--input INPUT] [--template TEMPLATE] [--output OUTPUT] [--test] [--nsys] [--list] [--range RANGE]

optional arguments:
  -h, --help            show this help message and exit
  --file FILE, -f FILE  config file path, default is ./config-template.json
  --input INPUT, -i INPUT
                        run experiments in batch mode. all config files are placed in input directory
  --template TEMPLATE, -T TEMPLATE
                        provide the path of template .sh file
  --output OUTPUT, -o OUTPUT
                        write execution output to specific path
  --test, -t            use cof run in test mode -> just generate bash script
  --nsys, -n            use nsys to profile your cuda programme
  --list, -l            list id of all input files, only available when input dir is provided
  --range RANGE, -r RANGE
                        support 3 formats: [int | int,int,int... | int-int], and int value must be > 0

Let's run the example:

cofrun -f demo_config.json -T demo_template.sh

And the execution history of cofrun will be written into history.cof

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cofutils-1.1.3.tar.gz (23.1 kB view details)

Uploaded Source

Built Distribution

cofutils-1.1.3-py3-none-any.whl (27.1 kB view details)

Uploaded Python 3

File details

Details for the file cofutils-1.1.3.tar.gz.

File metadata

  • Download URL: cofutils-1.1.3.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for cofutils-1.1.3.tar.gz
Algorithm Hash digest
SHA256 0c77082edc9f53628800bf40d2e10118b7d83d6145a08d7310784e2396170209
MD5 5c20d14d1285d5737a6ed5247133a677
BLAKE2b-256 ec381a97481752d22396c2a021a0b79ba70bd6e416a074066b7c42b22d6f7489

See more details on using hashes here.

File details

Details for the file cofutils-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: cofutils-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 27.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for cofutils-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 da946420da7eb8ca81eb2e2246e07ff04ea9e6108a3e5f6ba714ca65c8af0724
MD5 e0be0895b9201e5a27d15919e761c6ed
BLAKE2b-256 ed3d019809e87ba598979b360752b33946ad7a6bbed2759164350e86ac1dbd46

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page