Extending Theano with a GPU Op¶
Note
This covers the gpuarray back-end for the GPU.
This tutorial covers how to extend Theano with an op that offers a GPU implementation. It assumes you are familiar with how to write new Theano ops. If that is not the case you should probably follow the Creating a new Op: Python implementation and Extending Theano with a C Op sections before continuing on.
Writing a new GPU op can be done in Python for some simple tasks, but will usually done in C to access the complete API and avoid paying the overhead of a Python function call.
Dealing With the Context¶
One of the major differences with GPU ops is that they require a context (a.k.a. device) to execute. Most of the time you can infer the context to run on from your inputs. There is a way for the user to transfer things between contexts and to tag certain variables for transfer. It might also be the case that your inputs are not all from the same context and you would have to choose which one to run on.
In order to support all of those options and have a consistent
interface, theano.gpuarray.basic_ops.infer_context_name()
was
written. An example usage is below:
def make_node(self, a, b, c):
ctx = infer_context_name(a, b, c)
a = as_gpuarray_variable(a, ctx)
b = as_gpuarray_variable(b, ctx)
c = as_gpuarray_variable(c, ctx)
return Apply(self, [a, b, c], [a.type()])
In this example the Op takes three inputs, all on the GPU. In case
one or more of your inputs is not supposed to be on the GPU, you
should not pass it to infer_context_name()
or call
as_gpuarray_variable()
on it.
Also note that theano.gpuarray.basic_ops.as_gpuarray_variable()
takes context_name
as a mandatory parameter. This is because it’s
not enough to know you want the value to be on the GPU, you also want
to know which GPU to put it on. In almost all cases, you can pass in
the return value of infer_context_name()
there.
If you also need the context during runtime (for example to allocate the output), you can use the context of one of your inputs to know which one to use. Here is another example:
def perform(self, node, inputs, output_storage):
A, B = inputs
C, = output_storage
C[0] = pygpu.empty([A.shape[0], B.shape[1]], dtype=A.dtype, A.context)
pygpu.blas.gemm(1, A, B, 0, C, overwrite_c=True)
Finally if you require the context before perform, such as during make_thunk() to initialize kernels and such, you can access the context of your inputs through the type of the variables:
def make_thunk(self, node, storage_map, compute_map, no_recycling):
ctx = node.inputs[0].type.context
Note that GpuArrayType
objects also have a context_name
attribute which is the symbolic equivalent of context
. It can’t
be used for calls to pygpu or libgpuarray, but it should be used for
theano operations and variables.
The last place where you might need the context is in the C
initialization code. For that you will have to use the params. The params type should be
theano.gpuarray.type.gpu_context_type
and the params object
should be a context object from one of your input variables:
def get_params(self, node):
return node.inputs[0].type.context
If you don’t have any input variables on the GPU you can follow the
the example of GpuFromHost
or GpuEye
. This is not a case that you
should encounter often, so it will not be covered further.
Defining New Kernels¶
If your op needs to do some transformation on the data, chances are
that you will need to write a new kernel. The best way to do this is
to leverage GpuKernelBase
(or CGpuKernelBase
if you want to use the
COp
functionality).
For plain GpuKernelBase
, you have to define a
method called gpu_kernels
which returns a list of Kernel
objects. You can define as many
kernels as you want for a single op. An example would look like
this:
def gpu_kernels(self, node, name):
code = """
KERNEL void k(GLOBAL_MEM ga_double *a, ga_size n, ga_size m) {
ga_size nb = n < m ? n : m;
for (ga_size i = LID_0; i < nb; i += LDIM_0) {
a[i*m + i] = 1;
}
}"""
return [Kernel(
code=code, name="k",
params=[gpuarray.GpuArray, gpuarray.SIZE, gpuarray.SIZE],
flags=Kernel.get_flags('float64'))]
If you want to use COp
, then you should use CGpuKernelBase
instead. It adds a new section to the parsed files whose tag is
kernels
. Inside that section you can define some kernels with
#kernel name:params:flags
.
Here name
is the name of the kernel function in the following
code, params
is a comma-separeted list of numpy typecode names.
There are three exceptions for size_t
which should be noted as
size
, ssize_t
which should be noted as ssize
and a pointer
which should be noted as *
.
flags
is a |
-separated list of C kernel flag values (can be
empty). The same kernel definition as above would look like this with
CGpuKernelBase
:
#section kernels
#kernel k : *, size, size : GA_USE_DOUBLE
KERNEL void k(GLOBAL_MEM ga_double *a, ga_size n, ga_size m) {
ga_size nb = n < m ? n : m;
for (ga_size i = LID_0; i < nb; i += LDIM_0) {
a[i*m + i] = 1;
}
}
The second method is to handle the kernel compilation and cache on your own. This is not recommended because there are lots of details to pay attention to that can cripple your performance if not done right, which GpuKernelBase handles for you. But if you really want to go this way, then you can look up the C API for kernels in libgpuarray.
In any case you will need to call your compiled kernel with some data,
in most cases in your c_code()
method. This is done by using
the provided wrapper function. An example calling the above kernel
would be:
size_t ls, gs;
size_t dims[2];
// ...
ls = 1;
gs = 256;
err = k_call(1, &gs, &ls, 0, input->ga.data, dims[0], dims[1]);
// ...
The name of the wrapper function depends on the name you passed to
Kernel()
when you declared it (or the name in your #kernel
statement). It defaults to name + ‘_call’.
For other operations in the C code you should refer to the libgpuarray documentation.
A Complete Example¶
This is a complete example using both approches for a implementation of the Eye operation.
GpuKernelBase¶
Python File¶
class GpuEye(GpuKernelBase, Op):
"""
Eye for GPU.
"""
__props__ = ('dtype', 'context_name')
_f16_ok = True
def __init__(self, dtype=None, context_name=None):
if dtype is None:
dtype = config.floatX
self.dtype = dtype
self.context_name = context_name
def get_params(self, node):
return get_context(self.context_name)
def make_node(self, n, m, k):
n = tensor.as_tensor_variable(n)
m = tensor.as_tensor_variable(m)
k = tensor.as_tensor_variable(k)
assert n.ndim == 0
assert m.ndim == 0
assert k.ndim == 0
otype = GpuArrayType(dtype=self.dtype,
broadcastable=(False, False),
context_name=self.context_name)
# k != 0 isn't implemented on the GPU yet.
assert tensor.get_scalar_constant_value(k) == 0
return Apply(self, [n, m], [otype()])
def infer_shape(self, node, in_shapes):
out_shape = [node.inputs[0], node.inputs[1]]
return [out_shape]
def grad(self, inp, grads):
return [grad_undefined(self, i, inp[i])
for i in xrange(3)]
def gpu_kernels(self, node, name):
code = """
KERNEL void eye(GLOBAL_MEM %(ctype)s *a, ga_size n, ga_size m) {
ga_size nb = n < m ? n : m;
for (ga_size i = LID_0; i < nb; i += LDIM_0) {
a[i*m + i] = %(write_a)s(1);
}
}""" % dict(ctype=pygpu.gpuarray.dtype_to_ctype(self.dtype),
name=name, write_a=write_w(self.dtype))
return [Kernel(
code=code, name="eye",
params=[gpuarray.GpuArray, gpuarray.SIZE, gpuarray.SIZE],
flags=Kernel.get_flags(self.dtype),
objvar='k_eye_' + name)]
def c_code(self, node, name, inp, out, sub):
n, m = inp
z, = out
fail = sub['fail']
ctx = sub['params']
typecode = pygpu.gpuarray.dtype_to_typecode(self.dtype)
sync = bool(config.gpuarray.sync)
kname = self.gpu_kernels(node, name)[0].objvar
s = """
size_t dims[2] = {0, 0};
size_t ls, gs;
int err;
dims[0] = ((dtype_%(n)s*)PyArray_DATA(%(n)s))[0];
dims[1] = ((dtype_%(m)s*)PyArray_DATA(%(m)s))[0];
Py_CLEAR(%(z)s);
%(z)s = pygpu_zeros(2, dims,
%(typecode)s,
GA_C_ORDER,
%(ctx)s, Py_None);
if (%(z)s == NULL) {
%(fail)s
}
ls = 1;
gs = 256;
err = eye_call(1, &gs, &ls, 0, %(z)s->ga.data, dims[0], dims[1]);
if (err != GA_NO_ERROR) {
PyErr_Format(PyExc_RuntimeError,
"gpuarray error: kEye: %%s. n%%lu, m=%%lu.",
GpuKernel_error(&%(kname)s, err),
(unsigned long)dims[0], (unsigned long)dims[1]);
%(fail)s;
}
if(%(sync)d)
GpuArray_sync(&%(z)s->ga);
""" % locals()
return s
def c_code_cache_version(self):
return (6,)
CGpuKernelBase¶
Python File¶
class GpuEye(CGpuKernelBase, Op):
"""
Eye for GPU.
"""
__props__ = ('dtype', 'context_name')
_f16_ok = True
def __init__(self, dtype=None, context_name=None):
if dtype is None:
dtype = config.floatX
self.dtype = dtype
self.context_name = context_name
CGpuKernelBase.__init__(self, ['tstgpueye.c'],
'APPLY_SPECIFIC(tstgpueye)')
def get_params(self, node):
return get_context(self.context_name)
def c_headers(self):
return ['<gpuarray/types.h>', '<gpuarray/kernel.h>']
def make_node(self, n, m):
n = tensor.as_tensor_variable(n)
m = tensor.as_tensor_variable(m)
assert n.ndim == 0
assert m.ndim == 0
otype = GpuArrayType(dtype=self.dtype,
broadcastable=(False, False),
context_name=self.context_name)
return Apply(self, [n, m], [otype()])
def infer_shape(self, node, in_shapes):
out_shape = [node.inputs[0], node.inputs[1]]
return [out_shape]
def grad(self, inp, grads):
return [grad_undefined(self, i, inp[i])
for i in xrange(2)]
def get_op_params(self):
return [('TYPECODE', str(dtype_to_typecode(self.dtype)))]
tstgpueye.c
¶
#section kernels
#kernel eye : *, size, size :
/* The eye name will be used to generate supporting objects. The only
you probably need to care about is the kernel object which will be
named 'k_' + <the name above> (k_eye in this case). This name also
has to match the kernel function name below.
*/
KERNEL void eye(GLOBAL_MEM DTYPE_o0 *a, ga_size n, ga_size m) {
ga_size nb = n < m ? n : m;
for (ga_size i = LID_0; i < nb; i += LDIM_0) {
a[i*m + i] = 1;
}
}
#section support_code_struct
int APPLY_SPECIFIC(tstgpueye)(PyArrayObject *n, PyArrayObject *m,
PyGpuArrayObject **z, PyGpuContextObject *ctx) {
size_t dims[2] = {0, 0};
size_t ls, gs;
void *args[3];
int err;
dims[0] = ((DTYPE_INPUT_0 *)PyArray_DATA(n))[0];
dims[1] = ((DTYPE_INPUT_1 *)PyArray_DATA(m))[0];
Py_XDECREF(*z);
*z = pygpu_zeros(2, dims,
TYPECODE,
GA_C_ORDER,
ctx, Py_None);
if (*z == NULL)
return -1;
ls = 1;
gs = 256;
/* The eye_call name comes from the kernel declaration above. */
err = eye_call(1, &gs, &ls, 0, (*z)->ga.data, dims[0], dims[1]);
if (err != GA_NO_ERROR) {
PyErr_Format(PyExc_RuntimeError,
"gpuarray error: kEye: %s. n%lu, m=%lu.",
GpuKernel_error(&k_eye, err),
(unsigned long)dims[0], (unsigned long)dims[1]);
return -1;
}
return 0;
}
Wrapping Exisiting Libraries¶
PyCUDA¶
For things in PyCUDA (or things wrapped with PyCUDA), we usually need to create a PyCUDA context. This can be done with the following code:
with gpuarray_cuda_context:
pycuda_context = pycuda.driver.Context.attach()
If you don’t need to create a context, because the library doesn’t require it, you can also just use the pygpu context and a with statement like above for all your code which will make the context the current context on the cuda stack.
GpuArray objects are compatible with PyCUDA and will expose the necessary interface so that they can be used in most things. One notable exception is PyCUDA kernels which require native objects. If you need to convert a pygpu GpuArray to a PyCUDA GPUArray, this code should do the trick:
assert pygpu_array.flags['IS_C_CONTIGUOUS']
pycuda_array = pycuda.gpuarray.GPUArray(pygpu_array.shape,
pygpu_array.dtype,
base=pygpu_array,
gpudata=(pygpu_array.gpudata +
pygpu_array.offset))
As long as the computations happen on the NULL stream there are no special considerations to watch for with regards to synchronization. Otherwise, you will have to make sure that you synchronize the pygpu objects by calling the .sync() method before scheduling any work and synchronize with the work that happends in the library after all the work is scheduled.