## zero allocation continuations - part 3

16 july 2017

In part 1 and part 2 we have explored some techniques that allow us to build simple future-like computation chains without type-erasure or allocations. While our examples demonstrated the idea of nesting computations by moving *this into a parent node (resulting in "huge types"), they did not implement any operation that could be executed in parallel.

Our goal is to have a new when_all node type at the end of this article, which takes an arbitrary amount of Callable objects, invokes them in parallel, aggregates the results, and invokes an eventual continuation afterwards. We'll do this in a non-blocking manner: the thread that completes the last Callable will continue executing the rest of the computation chain without blocking or context-switching.

How can we achieve this? Is our existing design good enough?

### flaws with the current design

In our current design, the computation chain is executed bottom-to-top:

• Every node only knows about its parent, but not about its children.

• Every node expects to get a result when invoking the .call_with_parent() member function. While this is enough to model linear chains, it's not enough for a non-blocking fork/join model - it would mean that .call_with_parent() would have to block until all the parallel computations are completed so that a value can be returned.

Here's a diagram of the current design:

Note that we could make when_all fit in our current design, but it would need to block until all its computations are completed (e.g. using a latch). Here's an example of how it would look:

In order to allow when_all to have non-blocking behavior, we have to almost literally turn our design upside-down: we'll begin at the root, recursively executing the children of the current node.

• When we are at the root, we'll simply execute its child.

• When executing a node, we'll invoke the stored computation and execute its child in the same thread, with the result of the function.

• When executing a when_all, we'll invoke every stored computation in parallel - the last one that finishes will execute the next child in the same thread.

We'll have to solve two problems:

1. Finding the root, while propagating the children.

2. Going down the chain, executing everything in a non-blocking manner.

### going up

Here's a simple chain:

auto f = initiate([]{ return 1; })
.then([](int x){ return x + 1; })
.then([](int x){ return x + 1; });

Remember that:

• Every node stores and can access its parent, but has no knowledge about its children.

• f represents the final leaf of the chain.

How can we possibly find the root node while still keeping track of all the children found along the way?

The answer is, of course, recursion. We'll define a walk_up member function that, when invoked, will invoke itself recursively on the parent, passing the current node as an argument. This will build a linear chain of nodes as a variadic set of function arguments.

template <typename Scheduler, typename... Children>
void node</* ... */>::walk_up(Scheduler&& s, Children&... cs) &
{
this->as_parent().walk_up(s, *this, cs...);
}

The first node on which walk_up will be invoked is the final leaf. It will pass the scheduler and itself to its parent, which will do the same recursively until the root is found. Upon reaching the root, we'll stop going up and begin going down, having every node execute itself and its child.

Here's an example diagram of walk_up invocations:

### going down

The walk_up implementation for root will start "going down", by calling .execute on its child:

template <typename Scheduler, typename Child, typename... Children>
void root::walk_up(Scheduler&& s, Child& c, Children&... cs) &
{
c.execute(s, nothing{}, cs...);
}

The semantics of .execute are as follows:

• It always returns void.

• It uses the scheduler only when appropriate.

• For linear nodes .execute will completely avoid using the scheduler, in order to chain everything on the same thread.

• For fork/join nodes with $$N$$ computations it will execute one computation on the current thread, and schedule the remaining $$N - 1$$.

• It will call the current node's stored computation(s) with the result value received from the parent.

• If the current node is not a leaf, it will call .execute on its child.

This is the signature of .execute:

// Leaf case:
template <typename Scheduler, typename Result>
void /* ... */::execute(Scheduler&&, Result&&) &;

// General case:
template <typename Scheduler, typename Result,
typename Child, typename... Children>
void /* ... */::execute(Scheduler&&, Result&&, Child&, Children&...) &

As you can see, we always expect to get a scheduler and a result (which could be nothing) from our parent. If the current node is not a leaf, one or more children will be passed to .execute from its parent.

Here's a diagram showing a simplified example of going down a linear chain:

You can see the benefit of this approach in the following diagram. The scenario is as follows: after the root, there is a when_all node with three computations ($$A_0$$, $$A_1$$, $$A_2$$). In the particular hypothetical example run shown in the diagram, $$A_2$$ is the last node to complete execution. Since $$A_2$$ knows that its children are $$B$$ and $$C$$, it can invoke .execute on $$B$$ on the same thread, without requiring blocking/synchronization.

### implementation: linear node

Let's now take a look at how this new design is implemented in a linear chain. Do not worry - I promise that I'll show the implementation for when_all in this article. We'll begin with boilerplate:

• Every node will expose the input_type and output_type type aliases which will be used to easily compute the final result value of a computation chain.

• Nodes will derive from a child_of<Parent> base class, that reduces code repetition and simplifies CRTP usage.

struct root
{
// The root produces nothing.
using output_type = nothing;

// When we are at the root, we cannot go "up" the chain anymore.
// Therefore we being going "down".
template <typename Scheduler, typename Child, typename... Children>
void walk_up(Scheduler&& s, Child& c, Children&... cs) &
{
c.execute(s, nothing{}, cs...);
}
};
template <typename Parent>
struct child_of : Parent
{
// The input type of a node is the output of its parent.
using input_type = typename Parent::output_type;

template <typename ParentFwd>
child_of(ParentFwd&& p);

auto& as_parent() noexcept;
{
return static_cast<Parent&>(*this);
}
};

Here's node's interface:

template <typename Parent, typename F>
struct node : child_of<Parent>, F
{
using typename child_of<Parent>::input_type;
using output_type = result_of_ignoring_nothing_t<F&, input_type>;

template <typename ParentFwd, typename FFwd>
node(ParentFwd&&, FFwd&&);

template <typename FThen>
auto then(FThen&&) &&;

auto& as_f() noexcept;

template <typename Scheduler, typename Result>
void execute(Scheduler&&, Result&&) &;

template <typename Scheduler, typename Result,
typename Child, typename... Children>
void execute(Scheduler&&, Result&&, Child&, Children&...) &;

template <typename Scheduler, typename... Children>
void walk_up(Scheduler&&, Children&...) &;

template <typename Scheduler>
decltype(auto) wait_and_get(Scheduler&&) &&;
};

It's very similar to the implementation we had in the previous article, but there are a few things to note:

• output_type uses a weird result_of_ignoring_nothing_t type alias - we'll get to that in a bit. In short, it prevents passing nothing to another function to make the user interface more convenient.

• then is now ref-qualified with && - it's a good idea to enforce that, as *this will be moved during its invocation. (Thanks to Андрей Давыдов for the suggestion.)

• We have execute and walk_up.

With our new design, wait_and_get remains unchanged. The definition of node</* ... */>::then is however greatly simplified, because we don't need to propagate parents anymore:

template <typename FThen>
auto node</* ... */>::then(FThen&& f_then) &&
{
return ::node{std::move(*this), FWD(f_then)};
}

Let's now take a look at .execute (general case):

template <typename Scheduler, typename Result,
typename Child, typename... Children>
void node</* ... */>::execute(Scheduler&& s, Result&& r,
Child& c, Children&... cs) &
{
c.execute(s, as_f()(FWD(r)), cs...);
}

Firstly, it executes the stored computation with the value received from the parent node: as_f()(FWD(r)). It then invokes .execute on the next child, propagating the scheduler, the result of the computation, and the eventual rest of the children.

as_f()(FWD(r)) is a lie. The real code is actually call_ignoring_nothing(as_f(), FWD(r)). The call_ignoring_nothing(f, xs...) invokes f passing all xs... that are not nothing as arguments. Its implementation is explained in the appendix. The previously encountered result_of_ignoring_nothing_t behaves like std::invoke_result_t, but uses call_ignoring_nothing instead of std::invoke.

That's basically it for linear nodes - you can find a complete example on wandbox.org or here on GitHub.

### implementation: when_all

Finally, we're going to see some parallelism in action. The plan is as follows: we'll create a new type of node called when_all that:

• Stores $$N$$ computations.

• Schedules $$N - 1$$ computations and executes the remaining one on the current thread.

• Uses an internal std::atomic counter to keep track of pending computations. Whenever a computation is completed, the counter will be decremented and checked atomically: if it reached $$0$$, it means that we're the last computation to complete and that we need to move further down the chain.

• Stores a std::tuple of the final results that will be filled by every computation.

This is what the interface looks like:

template <typename Parent, typename... Fs>
struct when_all : child_of<Parent>, Fs...
{
using typename child_of<Parent>::input_type;
using output_type =
std::tuple<result_of_ignoring_nothing_t<Fs&, input_type>...>;

movable_atomic<std::size_t> _left{sizeof...(Fs)};
output_type _out;

template <typename ParentFwd, typename... FFwds>
when_all(ParentFwd&&, FFwds&&...);

template <typename... FThens>
auto then(FThens&&...) &&;

template <typename Scheduler, typename... Children>
void walk_up(Scheduler&&, Children&... cs) &;

template <typename Scheduler, typename Result>
void execute(Scheduler&&, Result&&) &;

template <typename Scheduler, typename Result,
typename Child, typename... Children>
void execute(Scheduler&&, Result&&, Child&, Children&...) &;

template <typename Scheduler>
decltype(auto) wait_and_get(Scheduler&&) &&;
};

template <typename ParentFwd, typename... FFwds>
when_all(ParentFwd&&, FFwds&&...)
-> when_all<std::decay_t<ParentFwd>, std::decay_t<FFwds>...>;

It is almost identical to a linear node, but has a few important differences:

• The constructor and deduction guide accept $$N$$ Callable objects. We could statically assert that $$N > 1$$.

• The struct contains the _left and _out members: these represent the shared state between all parallel computations. The _left atomic, as previously said, is used to figure out what computation finished last. The _out data member is used to store computations' result values.

• Note that these cannot be local variables in .execute, as it is completely non-blocking. We previously got away with a stateless node class because we had no scheduling and a single execution path: in that particular case we could just call .execute on the next child with the result value of the computation, executed in the same thread.

You might also have noticed that .then is now a variadic function template. It will now either produce a node or a when_all depending on the amount of passed functions:

template <typename... FThens>
auto /* ... */::then(FThens&&... f_thens) &&
{
static_assert(sizeof...(FThens) > 0);

if constexpr(sizeof...(FThens) == 1)
{
return ::node{std::move(*this), FWD(f_thens)...};
}
else
{
return ::when_all{std::move(*this), FWD(f_thens)...};
}
}

The initiate function has also become more interesting: it will instantiate a schedule, which is a new auxiliary node type which simply schedules the next child (instead of running it on the main thread). This allows more fine-grained control in the future, as the user might or might not want to schedule a computation chain depending on the context. Every invocation of initiate will either return a node or a when_all:

template <typename... Fs>
auto initiate(Fs&&... fs)
{
return schedule{root{}}.then(FWD(fs)...);
}

Before diving into when_all::execute, let's take a look at an usage example:

auto f = initiate([]{ std::puts("A0"); return 1; },
[]{ std::puts("A1"); return 2; })
.then([](auto t)
{
auto [a0, a1] = t;
return a0 + a1;
});

assert(std::move(f).wait_and_get(world_s_best_thread_pool{}) == 3);

In the above code snippet, initiate will produce a when_all node with two computations. In practice, you will see "A0" and "A1" being printed in an arbitrary order. When the last of the two computations is completed, the .then continuation will be invoked with a tuple containing $$(1, 2)$$. We then use structured bindings to unpack the tuple and return the final result, which should always be $$3$$. As always, wait_and_get guarantees to block until the whole chain is completed thanks to a latch.

Let's now look at the most interesting part: when_all::execute. We'll only analyze the general case, as the leaf case is pretty much the same.

template <typename Scheduler, typename Result,
typename Child, typename... Children>
void execute(Scheduler&& s, Result&& r, Child& c, Children&... cs) &
{
enumerate_args([&](auto i, auto t)
{
auto do_computation = [&]
{
using type = typename decltype(t)::type;

std::get<decltype(i){}>(_out) =
call_ignoring_nothing(static_cast<type&>(*this), r);

if(_left.fetch_sub(1) == 1)
{
c.execute(s, std::move(_out), cs...);
}
};

if constexpr(i == sizeof...(Fs) - 1)
{
do_computation();
}
else
{
s([&]{ do_computation(); });
}
}, type_wrapper_v<Fs>...);
}

There's a lot going on here. Firstly, let's look at enumerate_args:

enumerate_args([&](auto i, auto t)
{
// ...
}, type_wrapper_v<Fs>...)

This is a simple metaprogramming utility that, given an arbitrary amount of compile-time values, executes an user-defined action passing the index as the first argument and the value as the second argument. You can think about it as a compile-time version of Python's enumerate(...) function. Its implementation is explained in the appendix.

We're passing type_wrapper_v<Fs>... to enumerate_args. As the name suggests, type_wrapper is a simple compile-time value-like wrapper over a type - if you're familiar with Boost.Hana this should seem natural to you. Here's its implementation:

template <typename T>
struct type_wrapper { using type = T; };

template <typename T>
inline constexpr type_wrapper<T> type_wrapper_v{};

template <typename T>
using unwrap = typename std::decay_t<T>::type;

Our lambda will be invoked multiple times, like this:

• lambda(std::integral_constant<int, 0>{}, type_wrapper_v<f0>)

• lambda(std::integral_constant<int, 1>{}, type_wrapper_v<f1>)

• ...

• lambda(std::integral_constant<int, N>{}, type_wrapper_v<fN>)

Inside the lambda passed to enumerate_args we define another lambda: do_computation. It contains the instructions to execute for the current computation and will be invoked either with or without the scheduler:

auto do_computation = [&]
{
// Get the current computation's type out of the type wrapper.
using type = typename decltype(t)::type;

// Invoke the computation and assign its result to the i-th
// element of the _out tuple.
std::get<decltype(i){}>(_out) =
call_ignoring_nothing(static_cast<type&>(*this), r);

// If this was the last computation to finish, continue further
// down the computation chain.
if(_left.fetch_sub(1) == 1)
{
c.execute(s, std::move(_out), cs...);
}
};

The _left.fetch_sub(1) == 1 makes sure that the decrement and check is performed atomically. We're checking equality against 1 and not 0 as std::atomic<T>::fetch_sub returns the value of the atomic prior to the decrement.

Finally, we branch at compile-time to check whether or not we should schedule the current computation:

if constexpr(i == sizeof...(Fs) - 1)
{
// If this is the last computation, do not schedule it.
do_computation();
}
else
{
// Otherwise, schedule it.
s([&]{ do_computation(); });
}

This prevents "wasting" the thread running when_all</* ... */>::execute.

That's it! You can find a complete example on wandbox.org or here on GitHub.

Note that the current implementation produces invalid assembly with g++ due to a compiler bug. You can find more information here and it is trivial to work around the issue. Also, clang++ produces some nonsensical warnings related to deduction guides that I suppressed using -Wno-undefined-internal.

### appendix: call_ignoring_nothing

Given a FunctionObject f and a set of arguments xs..., this utility invokes f with all xs... that are not nothing. Here are some examples of what call_ignoring_nothing does:

• call_ignoring_nothing(f, 0, 1, 2) $$\to$$ f(0, 1, 2)

• call_ignoring_nothing(f, 0, nothing, 2) $$\to$$ f(0, 2)

• call_ignoring_nothing(f, nothing) $$\to$$ f()

• call_ignoring_nothing(f, 0, nothing, nothing, 1, nothing, 2, nothing) $$\to$$ f(0, 1, 2)

Its implementation is recursive. The base case is when there are no arguments: f is invoked immediately.

template <typename F>
decltype(auto) call_ignoring_nothing(F&& f)
{
}

Here's the recursive case:

template <typename F, typename T, typename... Ts>
decltype(auto) call_ignoring_nothing(F&& f, T&& x, Ts&&... xs)
{
return call_ignoring_nothing([&](auto&&... ys) -> decltype(auto)
{
if constexpr(std::is_same_v<std::decay_t<T>, nothing>)
{
return FWD(f)(FWD(ys)...);
}
else
{
return FWD(f)(FWD(x), FWD(ys)...);
}
}, FWD(xs)...);
}

Basically, it matches the first argument x and a possibly empty set of remaining arguments xs.... It recurses over call_ignoring_nothing, binding x to f if it's not nothing, otherwise completely ignoring it.

Using call_ignoring_nothing allows users to write...

initiate([]{ }).then([]{ });

initiate([](nothing){ }).then([](nothing){ });

### appendix: enumerate_args

This one is quite simple, but useful. The interface enumerate_args function invokes enumerate_args_impl, passing an std::index_sequence<0, 1, ..., N>, where N is the number of arguments:

template <typename F, typename... Ts>
void enumerate_args(F&& f, Ts&&... xs)
{
enumerate_args_impl(std::index_sequence_for<Ts...>{}, FWD(f), FWD(xs)...);
}

Inside enumerate_args_impl I use a fold expression that expands both Is... and xs... in lockstep to produce "pairs" of "index + argument":

template <typename F, typename... Ts, std::size_t... Is>
void enumerate_args_impl(std::index_sequence<Is...>, F&& f, Ts&&... xs)
{
(f(std::integral_constant<std::size_t, Is>{}, FWD(xs)), ...);
}