[DevLog] Profiler

3 분 소요

Overview

When it comes to preparel service launching with ML based models, optimization over all components of system are required. So to speak, reducing an unnecessary energy usage is important and energy that ML based model usually uses is time and memory of server. Especially Deep learning based model exploits a lot of energy to run a tremendous floating operations. To do so, GPUs are mainly used that is the key point of this post.

IT-based industries have used various management approaches and regarding the server side management, profiling tools are devised for it own sake. However, our interest of service uses GPUs which requires a huge computation power for floating operations repeatedly. So we need to care about the memory size and time consumption it uses for a single operation. Additionally, how many time single computation call a internal functions including value copy or allocating memeory for GPU.

Approaches

Python Decorator

So there are serveral trials to cursoring the energy usage. From the native point of view, we can manually implement logging memeory usage and runtime. Codes for measure time to running user defined operation might like :

def check_time(func):
    def wrap(*args, **kwargs):
        start = time.time()
        res = func(*args, **kwargs)
        print(f"Runtime : {time.time()-start: .9f}s")
        return res
    new_func = wrap
    new_func.__name__ = func.__name__
    return new_func

@check_time
def forward():
  ...

we can use this kind of decorator for torch based model or even user defined function! Surely it is useful but we need more! We need to memory that designated function allocates and deallocates. Regarding GPU usage, most of us are familiar with

!nvidia-smi

But we’d like to use this line anytime we want, even in the process of running whole lines. Python also offers for this kind of process using subprocess. Here is the example.

import os
import subprocess

...

def get_used_memory():
    used = subprocess.run(['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'], stdout=subprocess.PIPE).stdout.decode().split('\n')
    return used[: gpi_info['gpu_nums']]

To implement using this line, we need to know about each command line that we like to use like ‘nvidia-smi’. Additionally to handle output of corresponding input value, return value type will be string so that you can modify parsing part upon your purpose and line.

Now, we can measure time to run a specific function and memory usage once that function have run. But we cannot detect internal function call that pytorch uses. To do so we need to update every function including even package defined function!

torch.Profiler

Good news! Pytorch offers torch oriented profiler! Even we can measure how many time evey internal function had been called for every operation. User can define the code block to measure and separately measure time and memory usage. Here is the example

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True) as prof:
    with record_function("model_inference"):
            model(INPUTS) 
res = prof.key_averages().table(sort_by="cuda_memory_usage")

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                           aten::matmul         0.01%      21.016ms        45.66%       74.798s      84.137ms       0.000us         0.00%       28.089s      31.596ms           0 b           0 b    1871.44 Gb    -462.38 Gb           889  
                                              aten::add         0.05%      86.562ms        37.14%       60.850s      50.708ms        4.352s         9.89%        4.352s       3.626ms           0 b           0 b    1721.71 Gb    1721.71 Gb          1200  
                                               aten::mm         0.04%      70.258ms        34.43%       56.401s      93.845ms       23.016s        52.30%       23.016s      38.296ms           0 b           0 b    1387.12 Gb    1387.12 Gb           601  
                                              aten::sub         0.03%      50.381ms         8.67%       14.197s      14.913ms        2.524s         5.74%        2.524s       2.652ms           0 b           0 b     998.07 Gb     998.07 Gb           952  
                                              aten::mul         0.02%      36.983ms         0.04%      58.026ms      63.416us        2.526s         5.74%        2.526s       2.760ms           0 b           0 b     998.07 Gb     998.07 Gb           915  
                                            aten::clone         0.05%      77.992ms         0.59%     973.185ms      75.593us       0.000us         0.00%        1.753s 
                                            ...
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 163.833s
Self CUDA time total: 44.006s

As you can see the above samples, it offers a lot of information. Now, we can use this profiler when it comes to understanding model architecture, how the brand-new model optimized outdated codes and improved memory usages. Even we can detect unnecessary copy which are frequently called because of code lines.

Reference

Python-subprocess Nvidia-smi pdf torch Profiler Tutorial

Twitter Facebook LinkedIn

Yejin Kim

[DevLog] Profiler

Overview

Approaches

Python Decorator

torch.Profiler

Reference

공유하기

댓글남기기

참고

[Paper Review] TRPO and PPO

[Paper] Mamba: Linear-Time Sequence Modeling with Selective State Spaces

[SSM] Modeling Sequences with Structured State Spaces - Part I

[LLM-RL] Lecture 2: Value Functions