Skip to main content

torchhandy is a handy package that implements some commonly used modules and function.

Project description

TorchHandy

Installation

    pip install torchhandy

Introduction

This is a handy implementation of some useful pytorch modules and functions, which can be helpful for both students and researchers, especially in simplifying repeating coding procedure.

It's worth mentioning that most of the wrapped modules are written in a way the author is familiar with (currently), which means that some modules may be hard to use for some users, which the author should apologize in advance.

Config

Some modules requires a "config", which is actually a python class that has the corresponding attributes. This README will continue to update the details about the attributes.

The simplest way to use config is adding a python class called Config, and instantiate it into an object named "config", then pass this object as a argument to the module and everything will be fine. (Probably)

An example of how to write "config":

    class Config(object):
        '''
            Put the attributes you'd like to specified here
        '''
        dropout = 0.1
    
    config = Config()
    module = Module(config) # Module is a module in torchhandy that requires a "config" as a argument. 

Error Checking and Debugging

Please note that for the author's convenience almost no error checking is made and there're only few failure reports, which means that if you made some wrong configurations, you may come up with some really weird errors and have to read the f**king source code to solve them. Again, the author apologizes for this and this will be improved (sooner or later).

But there's not that much to worry about - the code written is so simple that you can easily understand what the author is doing. So do not hesitate to read or even modify the code for your covenience. The author sincerely believe that most users (if any but the author himself) has a better coding skill than the author.

Parallel

PyTorch offers simple interfaces for parallel training, and one of the most popular one is distributed data parallel (DDP). However, it's still not easy for learners to use and a boring task for even experts (because they have to write the same frame over and over again).

So the author comes up with this Parallel_Trainer. It's an extremely useful helper to simplify DDP training. The most amazing part of it is that it unifies the training procedure on one GPU or more than one GPUs. You can write the same training code and ignore the details about GPU settings, which will be handled by the Parallel_Trainer.

Initialization

The Parallel_Trainer accepts two arguments - a boolean value called "synch" and a config. If "synch" is set to True, then multiple GPUs will be used, otherwise only a single GPU will be used. For your convenience, I recommend passing the argument as a commandline argument, which can be done as follows:

    '''
        In this example, if you run your code with "... --synch ..." 
        (or "... -s ..."), "synch" will be set to True (otherwise False).
        Thus you can easily choose whether to use DDP when running your code.
    '''
    import argparse
    parser = argparse.ArgumentParser(description = 'yourdescription')
    parser.add_argument('--synch',
                        '-s',
                        action = 'store_true')
    args, unknown = parser.parse_known_args()
    synch = args.synch

For the "config", 2 attributes shoule be specified : n_gpus (which is useful in DDP training, indicating how many gpus you'd like to use) and device (which is useful in single device training, indicating the certain device you'd like to put your model and data on). The author suggest setting them both which will release your effort in changing from one training type to another.

Therefore, a typical initialization of a parallel trainer can be seen as follows:

    from torchhandy.Parallel import Parallel_Trainer
    import argparse

    class Config(object):
        n_gpus = 2
        device = 'cuda'

    if __name__ == '__main__':
        parser = argparse.ArgumentParser(description = 'yourdescription')
        parser.add_argument('--synch',
                            '-s',
                            action = 'store_true')
        args, unknown = parser.parse_known_args()
        synch = args.synch
        config = Config()
        trainer = Parallel_Trainer(synch, config)

GPU settings and program starting

How to start DDP training? Suppose your main code's filename is "main.py" and your n_gpus = n, then you can start DDP training with the command:

    python3 -m torch.distributed.launch --nproc_per_node=n --master_port="29501" --use_env main.py --synch 

If you meet some weird bugs relating to communication port, you can change 29501 to other ports such as 29500, 29502, etc. If you do not want to use DDP, you can start your code like:

    python3 main.py

And if you use the Parallel_Trainer properly, no change will be needed in your main code!

You may wonder which gpus will this trainer use? Typically it will use the first n_gpus GPU for training. So if cuda:0 (normally the first GPU) cannot be used, will this trainer be useless?

The answer is absolutely no! We recommend the users to specify the visible GPUs to the ones they can use. For example, if cuda:0 and cuda:2 is busy for some reasons, you should change your code's visible GPUs to all GPUs except cuda:0 and cuda:2. There are many ways to do so, two of the most useful ways are setting inside your code or in your command line. The first way is tricky and sometimes lead to weird bugs (not restricted to the parallel_trainer, I've met this bug everywhere so I'm strongly against this method.)

The second one is also simple and clear. You can specify CUDA_VISIBLE_DEVICES=.... before you run your code. Then your code can only see GPUs you specified, and reorder them to cuda:0, cuda:1, ...

For example, you can run command:

    CUDA_VISIBLE_DEVICES=2,4,6,8 python3 -m torch.distributed.launch --nproc_per_node=2 --master_port="29501" --use_env main.py --synch

(suppose you have a good server with many GPUs on it).

And if you run nvidia-smi, you may find that cuda:2 and cuda:4 is busy, while in your code you should still call them cuda:0 and cuda:1!(which is important. If you use cuda:8 in your code, your code will fail to find the correct GPU in the example above because it has already reordered cuda:8 to cuda:3!)

Of course, it's still tiring to set the GPUs every time you start the training, so you can use:

    export CUDA_VISIBLE_DEVICES=...

And all the rode running under this terminal will only see GPUs you specified above, while code running under other terminals are not influenced.

However, in case web failure or other failure cases, we'd like our code to continue running even if we've quitted until we explicitly stop them or the code itself stops. The most common way to do so is using nohup:

    nohup python3 main.py

And all the outputs of the code will be redirected to a file called "nohup.out". However, this doesn't work when you start a DDP training. For DDP training, the simplest way is using "tmux", which can be viewed as a terminal. You can use tmux as follows:

    tmux new -s [session_name]

This command creates a new tmux session, which is similar to a terminal, and everything happening in this session will not be bothered by common failures such as tuning off the terminal, etc. If you want to reconnect to this session in a normal terminal, you can use:

    tmux attach -t [session_name]

If you forget your session_name, you can call:

    tmux list-sessions

In tmux, do not scroll your mouse because it has different meanings. If you'd like to view the input and output above, you should call "ctrl+b+[" first. Then you can naturally scroll your mouse to view the input and output. If you want to quit this mode, you should call "ctrl+c" (please be careful because if you press ctrl+c too many times you may accidentally stop the currently running program, which is another sad story.) If you want to quit tmux session, you should call "ctrl+b+d".

One thing you should notice is that in PyTorch < 2.0 the "local_rank" (which is the GPU corresponding to its certain process) is set in command line. However in PyTorch >= 2.0 it's set in environment variables, and this code is written under PyTorch >= 2.0, so if you have PyTorch < 2.0, you may have to modify some part of this code for using. (BTW, this is the only part in this package that requires PyTorch >= 2.0, other parts are safe to use with PyTorch < 2.0, probably...)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchhandy-0.1.8.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

torchhandy-0.1.8-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file torchhandy-0.1.8.tar.gz.

File metadata

  • Download URL: torchhandy-0.1.8.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.0

File hashes

Hashes for torchhandy-0.1.8.tar.gz
Algorithm Hash digest
SHA256 086df2981bfaa601d680752432996d7e5e2222e4af929c2afb13055d75894c07
MD5 5d1b6ce2014490aceb10a726bc7cb633
BLAKE2b-256 b3f05979616803c9ae13f33286ed26065a43fcddb71a0a6752561fcf347ad971

See more details on using hashes here.

File details

Details for the file torchhandy-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: torchhandy-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.0

File hashes

Hashes for torchhandy-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 dedfeb6a1fcf4bf8d3833c0ab2617dbfceeff071f40fc71859b805007f8381d4
MD5 8caf016523801e724e038680fb57bb40
BLAKE2b-256 7772a7a1d5cf32e2be6045193c1582067f2019f36d766834db0683b44e7f1a25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page