Adding a New HPX Binding to HPyX
This guide walks through adding a new HPX-backed function to HPyX — both
as a Python-callback parallel algorithm (goes in hpyx.parallel) and
as a C++-native kernel (goes in hpyx.kernels). The worked example
here is verified by tests/test_contributor_example.py — if the test
passes, every snippet in this guide is correct.
Prerequisites
- Read
docs/codebase-analysis/hpx/CODEBASE_KNOWLEDGE.mdsections 4.3 (parallel algorithms) and 5.1 (GIL discipline). - Read
docs/specs/2026-04-24-hpyx-pythonic-hpx-binding-design.mdsections 3.3, 3.4, and 3.5 (binding patterns). - Build HPyX with
HPYX_BUILD_CONTRIBUTOR_EXAMPLE=ON(default in dev).
Decision: callback track or kernel track?
| Question | Answer track |
|---|---|
| Does your function operate per-element on a Python-callable input? | Callback track (hpyx.parallel.*) |
| Does it operate on a numpy array with pure C++ math inside? | Kernel track (hpyx.kernels.*) |
Callback track is more flexible (accepts any Python callable) but slower (GIL acquire per iteration on GIL-mode Python; truly concurrent on 3.13t). Kernel track is always fast but only works on numeric ndarrays of supported dtypes.
Callback-track example: sum_of_squares
We're adding hpyx.parallel.sum_of_squares(policy, iterable) — a
transform-reduce that squares each element then sums.
Step 1: Write the C++ binding
In src/_core/parallel.cpp (or for a new module, create your own file
and register it in bind.cpp):
// Policy parameters arrive as individual integers matching _Token in
// hpyx.execution (kind, task, chunk, chunk_size). The C++ layer always
// uses hpx::execution::par directly; policy dispatch is handled at the
// Python level. Parameters are accepted but intentionally ignored here.
static double sum_of_squares(
int /*kind*/, bool /*task_flag*/, int /*chunk*/, std::size_t /*chunk_size*/,
nb::iterable src_it)
{
ensure_runtime();
std::vector<double> src;
for (auto item : src_it) {
src.push_back(nb::cast<double>(item));
}
HPYX_KERNEL_NOGIL; // release the GIL for the HPX call
return hpx::transform_reduce(
hpx::execution::par, src.begin(), src.end(), 0.0,
std::plus<>{}, // reduction op
[](double x) { return x * x; }); // transform op
}
Key points:
- HPYX_KERNEL_NOGIL releases the GIL for the entire HPX call. Our
transform and reduce ops are pure C++ (std::plus, a lambda over
double) — they never touch nb::object, so this is safe.
- If your transform op needs Python (e.g., pred(x)), use
HPYX_CALLBACK_GIL inside the lambda. See count_if for reference.
- The policy parameters (kind, task_flag, chunk, chunk_size) are
accepted in the signature to match the calling convention Python expects,
but the C++ layer uses hpx::execution::par unconditionally. If you need
seq/par_unseq support, add a Python-level branch or a separate C++
overload.
Step 2: Register the binding
In the register_bindings(nb::module_& m) function for your submodule:
m.def("sum_of_squares",
&sum_of_squares,
"kind"_a, "task"_a, "chunk"_a, "chunk_size"_a, "src"_a);
Step 3: Write the Python wrapper
Append to src/hpyx/parallel.py:
def sum_of_squares(policy, iterable):
"""Sum of squares of elements under the given execution policy."""
_runtime.ensure_started()
t = policy._token()
return _core.parallel.sum_of_squares(t.kind, t.task, t.chunk, t.chunk_size, iterable)
Add "sum_of_squares" to __all__.
Step 4: Write the test
In tests/test_parallel.py:
def test_sum_of_squares():
assert hpyx.parallel.sum_of_squares(par, [1, 2, 3, 4]) == 30.0
Step 5: Rebuild and run
pixi run -e test-py313t pip install --force-reinstall --no-build-isolation -ve .
pixi run -e test-py313t pytest tests/test_parallel.py::test_sum_of_squares -v
Kernel-track example: l2_norm_squared
We're adding hpyx.kernels.l2_norm_squared(a) over a numpy ndarray.
Step 1: Write the C++ binding (templated over dtype)
Append to src/_core/kernels.cpp:
template <typename T>
static double l2_norm_squared(
nb::ndarray<nb::numpy, const T, nb::c_contig> a)
{
ensure_runtime();
const T* p = a.data();
std::size_t n = a.size();
HPYX_KERNEL_NOGIL;
return static_cast<double>(
hpx::transform_reduce(
hpx::execution::par, p, p + n, T{0},
std::plus<T>{},
[](T x) { return x * x; }));
}
Step 2: Register for each dtype
m.def("l2_norm_squared", &l2_norm_squared<float>, "a"_a);
m.def("l2_norm_squared", &l2_norm_squared<double>, "a"_a);
m.def("l2_norm_squared", &l2_norm_squared<int32_t>, "a"_a);
m.def("l2_norm_squared", &l2_norm_squared<int64_t>, "a"_a);
Step 3: Write the Python wrapper
In src/hpyx/kernels.py:
def l2_norm_squared(a):
"""L2 norm squared — sum of squared elements of a numpy array."""
_runtime.ensure_started()
_check(a, "l2_norm_squared")
return _core.kernels.l2_norm_squared(a)
Step 4: Test against numpy
def test_l2_norm_squared_matches_numpy():
a = np.random.rand(10_000).astype(np.float64)
assert hpyx.kernels.l2_norm_squared(a) == pytest.approx(np.sum(a * a))
GIL discipline checklist
Before submitting a PR, verify:
-
nb::objectis never accessed from within aHPYX_KERNEL_NOGILblock. - Every Python callback inside a C++ lambda uses
HPYX_CALLBACK_GIL. - Every kernel body over
nb::ndarrayusesHPYX_KERNEL_NOGIL. - If you call
fn(*args)or similar, you hold the GIL. - If you block (e.g., wait on a future), you don't hold the GIL.
Common mistakes
- Forgetting
ensure_runtime()/_runtime.ensure_started(). The C++ side will throwRuntimeError. Always callensure_runtime()at the top of your C++ function and_runtime.ensure_started()in the Python wrapper. - Capturing a
nb::callableby reference in a task lambda. The task may outlive the Python stack frame. Always capture by value. - Using
HPYX_KERNEL_NOGILwhen the transform_op is a Python callable. The callable can't run without the GIL. Either useHPYX_CALLBACK_GILinside the transform lambda (callback track), or rewrite the op in pure C++ (kernel track). - Forgetting the task-returning variant. If your algorithm accepts
a policy and the policy might carry the
tasktag, add a_task-suffixed C++ function that returnsHPXFuture<T>and a matching branch in your Python wrapper. - Non-contiguous ndarray. Nanobind's
nb::c_contigconstraint will throw at the Python layer. Document that callers must passnp.ascontiguousarray(a)if needed, or add a Python-side check.