Profiling Theano functionΒΆ
Note
This method replace the old ProfileMode. Do not use ProfileMode anymore.
Besides checking for errors, another important task is to profile your
code. For this, you can use Theano flags and/or parameters which are
to be passed as an argument to theano.function
.
The simplest way to profile Theano functions is to use the Theano flags described below. When the process exits, they will cause the information to be printed on stdout.
Using the ProfileMode is a three-step process.
Enabling the profiler is pretty easy. Just use the Theano flag
config.profile
.
To enable the memory profiler use the Theano flag:
config.profile_memory
in addition to config.profile
.
To enable the profiling of Theano optimization phase, use the Theano
flag: config.profile_optimizer
in addition to
config.profile
.
You can use the Theano flags profiling.n_apply
,
profiling.n_ops
and profiling.min_memory_size
to
modify the quantify of information printed.
The profiler will output one profile per Theano function and profile that is the sum of the printed profile. Each profile contains 4 sections: global info, class info, Ops info and Apply node info.
In the global section, the “Message” is the name of the Theano
function. theano.function() has an optional parameter name
that
defaults to None. Change it to something else to help you profile many
Theano functions. In that section, we also see the number of time the
function was called (1) and the total time spent in all those
calls. The time spent in Function.fn.__call__ and in thunks is useful
to help understand Theano overhead.
Also, we see the time spent in the two parts of the compilation process: optimization(modify the graph to make it more stable/faster) and the linking (compile c code and make the Python callable returned by function).
The class, Ops and Apply nodes sections are the same information: information about the Apply node that ran. The Ops section takes the information from the Apply section and merge the Apply nodes that have exactly the same op. If two Apply nodes in the graph have two Ops that compare equal, they will be merged. Some Ops like Elemwise, will not compare equal, if their parameters differ (the scalar being executed). So the class section will merge more Apply nodes then the Ops section.
Here is an example output when we disable some Theano optimizations to give you a better idea of the difference between sections. With all optimizations enabled, there would be only one op left in the graph.
Note
To profile the peak memory usage on the GPU you need to do:
* In the file theano/sandbox/cuda/cuda_ndarray.cu, set the macro
COMPUTE_GPU_MEM_USED to 1.
* Then call theano.sandbox.cuda.theano_allocated()
It return a tuple with two ints. The first is the current GPU
memory allocated by Theano. The second is the peak GPU memory
that was allocated by Theano.
Do not always enable this, as this slowdown memory allocation and free. As this slowdown the computation, this will affect speed profiling. So don’t use both at the same time.
to run the example:
THEANO_FLAGS=optimizer_excluding=fusion:inplace,profile=True python doc/tutorial/profiling_example.py
The output:
Function profiling
==================
Message: None
Time in 1 calls to Function.__call__: 5.698204e-05s
Time in Function.fn.__call__: 1.192093e-05s (20.921%)
Time in thunks: 6.198883e-06s (10.879%)
Total compile time: 3.642474e+00s
Theano Optimizer time: 7.326508e-02s
Theano validate time: 3.712177e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 9.584920e-01s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 0.000s 2.07e-06s C 3 3 <class 'theano.tensor.elemwise.Elemwise'>
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
65.4% 65.4% 0.000s 2.03e-06s C 2 2 Elemwise{add,no_inplace}
34.6% 100.0% 0.000s 2.15e-06s C 1 1 Elemwise{mul,no_inplace}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
50.0% 50.0% 0.000s 3.10e-06s 1 0 Elemwise{add,no_inplace}(x, y)
34.6% 84.6% 0.000s 2.15e-06s 1 2 Elemwise{mul,no_inplace}(TensorConstant{(1,) of 2.0}, Elemwise{add,no_inplace}.0)
15.4% 100.0% 0.000s 9.54e-07s 1 1 Elemwise{add,no_inplace}(Elemwise{add,no_inplace}.0, z)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)