Skip to content

Benchmarking ALSSM ND Backends: numpy, lfilter, jit, and GPU [code910.1]

Measures the throughput (in mega-samples per second, MS/s) of the RLSAlssm filter using an NDCompositeCost for the numpy, lfilter, and (if available) jit and cupy backends.

An NDCompositeCost over a 2-D signal (image) is built using the same separable polynomial basis as in example-ex801.0-Text-Recognition.py.

Each backend processes a (K1 x K2) image and results are reported in mega-samples per second (MS/s = K1 * K2 * n_exe / proc_time * 1e-6).

Notes on fair GPU timing

  • CuPy kernel launches are asynchronous, so the device is synchronised around each timed call (cp.cuda.Device().synchronize()).
  • The first GPU call pays one-off allocation/compilation costs; a warm-up call is issued before timing starts.

Plot

Plot

Hardware (locally generated figures)

These figures were generated locally on the hardware listed below and cannot be reproduced on the CI build server.

Component Specification
GPU NVIDIA GeForce RTX A2000
CPU Intel Core i7-12700H
RAM 32 GB
OS Ubuntu 22.04 LTS
Date 2026-06-23

Code

"""
Benchmarking ALSSM ND Backends: numpy, lfilter, jit, and GPU [code910.1]
=========================================================================

Measures the throughput (in mega-samples per second, MS/s) of the
[`RLSAlssm`][lmlib.statespace.rls.RLSAlssm] filter using an
[`NDCompositeCost`][lmlib.statespace.cost.NDCompositeCost] for the ``numpy``,
``lfilter``, and (if available) ``jit`` and ``cupy`` backends.

An [`NDCompositeCost`][lmlib.statespace.cost.NDCompositeCost] over a 2-D signal (image) is built using
the same separable polynomial basis as in example-ex801.0-Text-Recognition.py.

Each backend processes a (K1 x K2) image and results are reported in
mega-samples per second (MS/s = K1 * K2 * n_exe / proc_time * 1e-6).

Notes on fair GPU timing
------------------------
* CuPy kernel launches are asynchronous, so the device is synchronised around
  each timed call (``cp.cuda.Device().synchronize()``).
* The first GPU call pays one-off allocation/compilation costs; a warm-up call
  is issued before timing starts.
"""
import timeit
import numpy as np
import matplotlib.pyplot as plt
import lmlib as lm

lm.WARNING_NOT_STEADY_STATE = False

# ---- Signal & model setup --------------------------------------------------
K1 = 200   # image rows
K2 = 200   # image columns
Y  = np.random.randn(K1, K2)
n_exe = 2  # number of filter executions per timing measurement

g          = 100
l_side     = 35
poly_degree = 2

alssm_l = lm.AlssmPolyLegendre(poly_degree=poly_degree, a_seg=-l_side, b_seg=-1)
alssm_r = lm.AlssmPolyLegendre(poly_degree=poly_degree, a_seg=0, b_seg=l_side)
seg_l   = lm.Segment(a=-l_side, b=-1, direction=lm.FW, g=g)
seg_r   = lm.Segment(a=0, b=l_side, direction=lm.BW, g=g)
F       = [[1, 0], [0, 1]]

cost_d1 = lm.CompositeCost([alssm_l, alssm_r], [seg_l, seg_r], F)
cost_d2 = lm.CompositeCost([alssm_l, alssm_r], [seg_l, seg_r], F)
nd_cost = lm.NDCompositeCost([cost_d1, cost_d2])

# ---- Backend selection -----------------------------------------------------
backends = ['numpy', 'lfilter']
if lm.is_backend_available('jit'):
    backends.append('jit')
if lm.is_backend_available('cupy'):
    backends.append('cupy')
    import cupy as cp
    lm.set_gpu_dtype(np.float32)
else:
    print("cupy backend not available — benchmarking CPU backends only.")

color_map = {'numpy': 'b', 'lfilter': 'k', 'jit': 'r', 'cupy': 'g'}
mspsecs_dict = {}

# ---- Benchmark loop --------------------------------------------------------
for backend in backends:
    rls = lm.RLSAlssm(nd_cost, steady_state=True, backend=backend, filter_form='cascade')

    if backend == 'cupy':
        rls.filter(Y)
        cp.cuda.Device().synchronize()

        def _run():
            rls.filter(Y)
            cp.cuda.Device().synchronize()
        proc_time = timeit.timeit(_run, number=n_exe)
    else:
        rls.filter(Y)  # warm-up / JIT compile
        proc_time = timeit.timeit('rls.filter(Y)', globals=globals(), number=n_exe)

    mspsec = K1 * K2 * n_exe * 1e-6 / proc_time
    mspsecs_dict[backend] = mspsec
    print(f"  {backend:8s}  {mspsec:.3f} MS/s")

# ---- Plot ------------------------------------------------------------------
labels = (f'NDCompositeCost.filter(Y)\n{K1}×{K2} image, poly_degree={poly_degree}',)
locs   = np.arange(len(labels))
width  = 0.25

fig, ax = plt.subplots(figsize=(9, 4))

for i, backend in enumerate(backends):
    rect_ = ax.barh(locs + width * i, [mspsecs_dict[backend]], width,
                    label=backend, color=color_map.get(backend))
    ax.bar_label(rect_, fmt="%0.2f", padding=4)

max_val = max(mspsecs_dict.values())
ax.set_xlim(0, max_val * 1.25)
ax.invert_yaxis()
ax.set_xlabel('MS/s')
ax.set_title(
    'Benchmarking of ND ALSSM filters:\nnumpy vs. lfilter vs. jit vs. GPU backends',
    pad=10,
)
ax.set_yticks(locs + width * (len(backends) - 1) / 2, labels, fontsize=9)
ax.legend()

fig.tight_layout()
plt.show()