---
execution:
  timeout: 30
jupytext:
  cell_metadata_json: true
  encoding: '# -*- coding: utf-8 -*-'
  formats: md:myst,ipynb
  notebook_metadata_filter: execution
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.16.4
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

(sec:GPUAcceleration)=
GPU Acceleration
================

If you have access to an [Nvidia][] GPU, then you can take advantage of GPU acceleration
in your code.  The basic idea here is to use a `State` that stores its data on the GPU.
To do this, use the code in {mod}`gpe.gpu` which should contain minimal modifications
needed to use the corresponding code on a GPU.  This is often as simple inheriting from
either {mod}`gpe.bec.StateBase` for CPU states and {mod}`gpe.gpu.bec.StateBase` for GPU
states.  We do this by putting the common code into a mixin class `StateMixin` and then
mixing this with the appropriate base class.  We will demonstrate this below.

The idea is that code in the core evolution loop (triggered by `compute_dy_dt` for ABM
evolvers or `apply_V` and `apply_K` for SplitOperator evolvers) will work with GPU
arrays: code called by this chain must be careful not to inadvertently get data from the
GPU, which is expensive and will kill performance gains.

:::{margin}
This used to be called "sundering" and `get_Vext_GPU` was spelled `_get_Vext_`.
:::
To keep the code useful for users who like to plot, etc. we define mirroring methods
that will fetch data from the GPU.  Thus, expect to see mirrored methods like
`get_Vext_GPU()` and `get_Vext()`.  The former does the calculation on the GPU, while
the latter is a simple wrapper that fetches this data after the calculation is done.
This can be done automatically by decorating your class with
{func}`gpe.utils._GPU.add_non_GPU_methods` or manually with the
{func}`gpe.utils._GPU.from_GPU` decorator.


```{code-cell}
import numpy as np
import matplotlib.pyplot as plt

import gpe.bec, gpe.gpu.bec, gpe.minimize

from gpe.utils import ExperimentBase, get_good_N, _GPU

u = gpe.bec.u

@_GPU.add_non_GPU_methods
class StateMixin:
    healing_length_micron = 1.0 # Healing length in microns
    dx_healing_length = 0.5     # Maximum lattice spacing in units of the healing length
    Lx_micron = 20.0            # Box length in units of microns
    V0_mu = -1.0                # Depth of the potential in units of the chemical potential.
    sigma_healing_length = 1.0  # Width of the potential in healing lengths
    n0_micron = 1.0             # Background density
    
    def __init__(self, **kw):
        for key in list(kw):
            if hasattr(self, key):
                setattr(self, key, kw.pop(key))

        m = u.m_Rb87
        hbar = u.hbar
        self.healing_length = self.healing_length_micron * u.micron
        Lx = self.Lx_micron * u.micron
        dx = self.dx_healing_length * self.healing_length
        n0 = self.n0_micron / u.micron
        
        # Select a good Nx that will perform well - small prime factors.
        Nx = get_good_N(Lx / dx)

        # Calculate the chemical potential and g
        mu = hbar**2 / 2 / m / self.healing_length
        g = mu / n0
        kw.update(hbar=hbar, m=m, Lxyz=(Lx,), Nxyz=(Nx,), mu=mu, g=g)
        super().__init__(**kw)

    def init(self):
        """Perform any initializations before the simulation begins."""
        # Units
        self.V0 = self.V0_mu * self.mu
        self.sigma = self.sigma_healing_length * self.healing_length
        super().init()

    def get_Vext_GPU(self):
        x, = self.get_xyz_GPU()
        if (self.initializing or self.t < 0):
            V0 = self.V0
        else:
            V0 = 0.0

        V_ext = V0 * np.exp(-(x/self.sigma)**2/2)
        
        if (self.initializing or self.t < 0) and getattr(self, "mu", None):
            # If we are initializing, we need to subtract mu to get the initial state
            V_ext -= self.mu
        return V_ext
    
    def minimize(self):
        m = gpe.minimize.MinimizeState(self)
        s = m.minimize()
        self.set_psi(s.get_psi())


class StateCPU(StateMixin, gpe.bec.StateBase):
    pass


class StateGPU(StateMixin, gpe.gpu.bec.StateBase):
    pass


s = StateCPU()
s.minimize()
s.plot()
```

```{code-cell}
:tags: [hide-cell]
# Should put this in a file

from pytimeode.evolvers import EvolverABM
from mmfutils.contexts import FPS

def plot_states(states):
    # Always plot dimensionless quantities...
    t_unit = u.ms
    x_unit = u.micron
    n_unit = 1 / x_unit

    plot_Nlr = hasattr(states[0], 'get_Nlr')
    
    s = states[-1]
    ns = np.array([s.get_density() for s in states])
    Es = np.array([s.get_energy() for s in states])
    Ns = np.array([s.get_N() for s in states])
    if plot_Nlr:
        Nlr = np.transpose([s.get_Nlr() for s in states])
    ts = np.array([s.t for s in states])
    xs = np.ravel(s.xyz[0])

    kw = dict(sharex=True,  gridspec_kw=dict(hspace=0))
    if plot_Nlr:
        fig, axs = plt.subplots(3, 1, height_ratios=(0.2, 0.5, 1), **kw)
    else:
        fig, axs = plt.subplots(2, 1, height_ratios=(0.2, 1), **kw)

    ax = axs[0]
    E, N = Es[0], Ns[0]
    ax.plot(ts / t_unit, Es / E - 1, label="$E$")
    ax.plot(ts / t_unit, Ns / N - 1, label="$N$")
    ax.set(ylabel="Rel. change", title=f"{E=:.4g}, {N=:.4g}")
    ax.legend()

    if plot_Nlr:
        ax = axs[1]
        Nl, Nr = Nlr
        ax.plot(ts / t_unit, Nl/Nl[0] - 1, label="$N_l$")
        ax.plot(ts / t_unit, Nr/Nl[0] - 1, label="$N_r$")
        ax.set(ylabel="Rel. change in $N_{lr}$")
        ax.legend()

    ax = axs[-1]
    mesh = ax.pcolormesh(ts / t_unit, xs / x_unit, ns.T / n_unit)

    # Label your axes...
    ax.set(ylabel="$x$ [μm]", xlabel="$t$ [ms]")

    # ... and your colorbars.
    fig.colorbar(mesh, ax=axs.ravel().tolist(), label="$n$ [1/μm]")


def evolve(s, T_ms=20, Nt=200, dt_t_scale=0.4, timeout_s=60):
    """Evolve the state s to time T and plot.
    
    Arguments
    ---------
    s : State
        Initial state.
    T_ms : float
        Final time in ms.
    Nt : int
        Number of frames to plot.
    dt_t_scale : float
        Step size for evolve in units of t_scale = hbar/maximum resolvable energy.
    timeout_s : float
         Timeout (in seconds).
    """
    T = T_ms * u.ms
    dT = T / Nt

    # Note: From experience, if you have chosen dx correctly, you should
    # get highly converged results if you use a time-step dt between 0.2 and 0.5 times
    # s.t_scale.  If you can go much larger, you are probably wasting effort.  If need
    # to go much smaller, you probably have an incorrect lattice spacing.
    dt = dt_t_scale*s.t_scale
    
    # Make sure we can take integer steps...
    steps = max(int(np.ceil(dT / dt)), 2)
    dt = dT / steps
    
    ev = EvolverABM(s, dt=dt)
    states = [ev.get_y()]
    for frame in FPS(frames=Nt, timeout=timeout_s):
        ev.evolve(steps)
        states.append(ev.get_y())


    plot_states(states)
    return states
```

```{code-cell}
s = StateCPU()
%time s.minimize()
%time evolve(s);
```

:::{warning}
The following GPU code will only run if you build this documentation on a gpu-enabled computer!
:::

```{code-cell}
if gpe.gpu.cupy is None:
    print("No GPU support! Skipping")
else:
    s = StateGPU()
    %time s.minimize()
    %time evolve(s);
```

In this particular code, the GPU code is about 4 times slower than the CPU code: 1D
problems do not speed up very well.

## Details

In order to promote code reuse, we try not to write different code for CPU and GPU
methods.  To this end, special methods used in the core computation loop are carefully
constructed to work on both CPU and GPU targets.  With tools like [cupy][], this works
quite well, but there are still some subtle points:

1. One needs to use the appropriate library -- `np=numpy` for CPU code and `cp=cupy` for
   GPU code.  To do this, we provide a reference called `self.xp` that will point to the
   appropriate version.  This needs to be initialized properly.
2. GPU methods should generally return GPU arrays, but may be called by the user for
   plotting and analysis.  In the latter case, the method should return a CPU array.
   Some mechanism needs to be used to indicate if the result should be left on the GPU
   or fetched to the CPU.
3. Some code inevitably must be specialized.  Again, some mechanism is needed to signal
   which code is which.


The new strategy is to decorate GPU methods.  These methods will have a boolean argument
`accel` allowing the method to customize the behaviour (if needed), and the decorator will
effect the necessary conversions if a GPU method is called from CPU code.

:::{admonition} Limitations
The main limitation with this new strategy is that there is no easy way to dispatch on
attribute access: `xyz` must be replaced by `get_xyz(accel=True)`.

Another issue is that now all core functions are decorated: this has two downsides:
1. Debugging becomes a real pain as we must now traverse through the decorators.
2. There is now a potential performance hit since we are using decorators.

:::

A couple of comments:

1. Methods in the core evolution loop should always try to use accelerated code if
   possible.  Thus, they should likely call `self.get_xyz(accel=True)` rather than
   `self.get_xyz(accel=accel)`.  The latter would trigger multiple attempted conversions
   of the data, which is probably not a problem, but if the data naturally sits on the
   GPU, then the internal calculations should be done there.  The final result will be
   converted by the decorator.
2. This provides some sort of check: if `get_xyz()` is implemented without an `accel`
   argument, then one will get an error.
3. Private internal methods need not be decorated: they can just return the appropriate
   data in the appropriate place.


```{code-cell}
import functools


class GPUArray(np.ndarray):
    """Mock GPU array."""
    def __new__(cls, array):
        return super().__new__(
            cls, shape=array.shape, dtype=array.dtype, buffer=array)
        
    gpu = True
    
    def get(self):
        """Mock method to get the data from the GPU to the CPU."""
        return np.array(self)
    

def accel(f):
    functools.wraps(f)
    def wrapped_f(*v, **kw):
        accel = kw.get('accel', False)
        kw.update(accel=accel)
        res = f(*v, **kw)
        if not accel:
            res = res.get()
        return res
    return wrapped_f
    
    
class A:
    def __init__(self):
        self._xyz = GPUArray(np.linspace(0, 1))
    
    @accel
    def get_xyz(self, accel):
        return self._xyz
        
    @accel
    def get_V(self, accel):
        xyz = self.get_xyz(accel=True)
        x = xyz[0]
        return x**2


a = A()
print(a.get_xyz())
print(a.get_xyz(accel=True))
```

## Old Strategy

Our original strategy was to define "sundered" methods: these start and end with a
single underscore -- e.g. `_get_Vext_` -- indicating that the method is needed in
the core computation loop, and should work on the GPU if available.  These methods use
and return GPU arrays if used in that context.

For each of these methods, a matching unsundered method -- e.g. `get_Vext` -- is
defined which appropriately fetches and converts arrays from the GPU to the CPU.

:::{margin}
The general design decisions here are to optimize performance for GPU access, allowing
slightly degraded performance for the CPU access through unsundered methods, which are
generally not needed in core loops.
:::
The developer should define or modify only the sundered method (`_get_Vext_`): the
corresponding unsundered methods are defined on the fly by
{class}`gpe.utils.AsNumpyMixin` which do the appropriate conversions.  This class also
performs some checks, for example, raising exceptions if both sundered and unsundered
methods are defined (generally an error).

This sundering strategy allowed us to use attributes like `_xyz_` (and the corresponding
CPU-based `xyz`).

[Nvidia]: <https://www.nvidia.com>
[cupy]: <https://cupy.dev/>
