One of the best features of futures (or promises, depending on your language background) is the ability of composing them through asynchronous continuations. Example:
// pseudocode
auto f = when_all([]{ return http_get("cat.com/nicecat.png"); },
[]{ return http_get("dog.com/doggo.png"); })
.then([](auto p0, auto p1)
{
("mail@grandma.com", combine(p0, p1));
send_email});
.execute(some_scheduler); f
In the above pseudocode snippet, the http_get
requests
can happen in parallel (depending on how the scheduler
works - you can imagine it’s a thread pool). As soon as both
requests are done, the final lambda is automatically invoked with both
payloads.
I find this really cool because it allows developers to define
directed acyclic graphs of asynchronous/parallel computations
with a very clear and readable syntax. This is why the currently
crippled std::future
is evolving into something more
powerful that supports .then
and when_all
:
these facilities are part of N4538: “Extensions for
concurrency” - Anthony Williams introduced them excellently in his
ACCU 2016 “Concurrency,
Parallelism and Coroutines” talk.
hidden (?) overhead
Let’s look more closely at the signature of std::experimental::future::then
:
template <class F>
auto std::experimental::future<T>::then(F&& func)
-> std::experimental::future<
std::result_of_t<std::decay_t<F>(std::experimental::future<T>)>
>;
As you can see, the return type is well-defined, and it is another
std::experimental::future
. This means that the continuation
is type erased into a new future. This is how a possible
implementation of then
might look like:
template <class T, class F>
auto then(std::experimental::future<T>& parent, std::launch policy, F&& f)
-> std::experimental::future<
std::result_of_t<F(std::experimental::future<T>&)>
>
{
return std::async(std::launch::async, [
= std::move(parent),
parent = std::forward<F>(f)
f ]
{
.wait();
parentreturn std::forward<F>(f)(parent);
});
}
Hmm… std::async
, type erasure… this means that there
might be potential allocations! Let’s attempt to roughly measure their
overhead. I created five
.cpp
files where I used boost::async
and
boost::future
to create a chain of .then
continuations, starting with zero continuations and ending with four.
Example:
// zero continuations
int main()
{
return boost::async([]{ return 123; }).get();
}
// one continuations
int main()
{
return boost::async([] { return 123; })
.then([](auto x) { return x.get() + 1; })
.get();
}
// ...
// four continuations
int main()
{
return boost::async([] { return 123; })
.then([](auto x) { return x.get() + 1; })
.then([](auto x) { return x.get() + 1; })
.then([](auto x) { return x.get() + 1; })
.then([](auto x) { return x.get() + 1; })
.get();
}
I then compiled every file with…
# g++ 7.1
g++ -std=c++1z -Ofast -DNDEBUG -S
…and counted the lines of stripped generated assembly. These are the results:
Continuations | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
Lines of asm | 10111 | 10949 | 11581 | 12178 | 12843 |
Plotted:

This shows a very approximate increment of 500-800 lines of assembly
per .then
continuation. I know what you’re thinking…
This is a terrible benchmark that doesn’t prove anything!
…and I almost completely agree. What I wanted to show is that due to
the use of boost::async
, type erasure, and
potential allocations, the compiler is unable to “compress” the
continuations into a single chain of operations and instead keeps every
future
instance allocated somewhere in memory.
This might be desirable, depending on your use case - the
Callable
invoked by .then
does indeed take a
future
as an argument instead of the result of
.get()
. But this might also be undesirable - in this
situation the compiler can see the entire chain of .then
calls, and could ideally allocate once for all of them.
Additionally, even if the overhead of allocations and scheduling is usually small compared to IO operations, it might add up if a program extensively uses asynchronous continuations.
an alternative design
I’d like to show an alternative design that doesn’t require
any allocation whatsoever and still enables users to
build up asynchronous computation chains using facilities such as
when_all
and .then
.
What’s the secret?
Avoiding type erasure and allowing the type of the
Callable
objects passed to the continuations to persist.
This is what we’re going to implement in this article:
int main()
{
auto f = initiate([]{ return 1; })
.then([](int x){ return x + 1; })
.then([](int x){ return x + 1; });
.execute(/* some scheduler */);
f}
How many lines of generated assembly with a purely sequential and synchronous scheduler?
struct synchronous_scheduler
{
template <typename F>
decltype(auto) operator()(F&& f) const
{
return f();
}
};
2! The compiler is able to inline everything.
How many lines of generated assembly with a scheduler that simply
calls std::async
?
struct asynchronous_scheduler
{
template <typename F>
decltype(auto) operator()(F&& f) const
{
auto fut = std::async(f);
return fut.get();
}
};
891!
…wait, is that a good result?
It is, because it doesn’t change if more continuations are added. As an example…
auto f = initiate([]{ return 1; })
.then([](int x){ return x + 1; })
// ...20 identical continuations...
.then([](int x){ return x + 1; });
…still produces 891
lines of assembly. This happens because decltype(f)
is
a huge type containing all the types of the continuations - the compiler
is able to inline everything.
implementation
Let’s analyze the implementation, beginning with
initiate
:
template <typename F>
auto initiate(F&& f)
{
return node{std::forward<F>(f)};
}
The initiate
function simply returns a node
storing the passed Callable
. If you’re not used to C++17
features yet, you might think that node
is a normal class:
it actually is a class template, and its template
arguments are being deduced thanks to class
template argument deduction. This is its perfect-forwarding deduction
guide:
template <typename FFwd>
(FFwd&&) -> node<std::decay_t<FFwd>>; node
It basically means that when a node
is constructed with
an object taken by FFwd&&
forwarding
reference, the deduced class template arguments will be
std::decay_t<FFwd>
. Let’s take a look at
node
’s definition:
template <typename F>
struct node : F
{
template <typename FFwd>
(FFwd&& f) : F{FWD(f)}
node{
}
template <typename FThen>
auto then(FThen&& f_then);
template <typename Scheduler>
decltype(auto) execute(Scheduler&& s) &
{
return s(*this);
}
};
The node
class is a lightweight struct
that
stores the Callable
F
via inheritance (to
enable the empty
base optimization). The node::execute
member
function simply invokes *this
through the passed scheduler
s
(which might be something like a thread pool).
It is ref-qualified with &
because some
schedulers might be non-blocking and we want to prevent temporaries from
being scheduled and immediately die before having a chance to finish
execution.
Also, FWD
is just a macro that expands to:
#define FWD(...) ::std::forward<decltype(__VA_ARGS__)>(__VA_ARGS__)
The final missing piece is node::then
:
template <typename FThen>
auto then(FThen&& f_then)
{
return ::node{[
= std::move(*this),
parent = FWD(f_then)
f_then ]() mutable -> decltype(auto)
{
return f_then(static_cast<F&>(parent)());
}};
}
So short, yet so interesting:
::node
is used instead ofnode
as class template argument deduction wouldn’t kick in otherwise (thanks T.C.!).A lambda expression is being passed to the new
node
, which:Captures
std::move(*this)
asparent
. While this looks weird, it is completely fine to move*this
as long as it’s not going to be used anymore afterwards. What we’re doing here is crucial: we’re moving the current computation chain inside the newly created one without any allocation or type erasure: this means thatdecltype(*this)
will be part of the type of the new continuation.Captures
f_then
by usingstd::forward
. This is OK, but might not do what you expect whenf_then
is an lvalue reference: you might be interested in my previous “capturing perfectly-forwarded objects in lambdas” article.Is marked as
mutable
. This is necessary as we may move the newly-creatednode
again in subsequent computations, and moving doesn’t play nicely withconst
.Has an explicit
decltype(auto)
trailing return type, to allow references to be returned from user-definedCallable
objects. A lambda expressions defaults to-> auto
otherwise, which always makes a copy.Invokes the continuation
f_then
with the result of invoking the move-captured*this
. This means that all the continuations will be executed on the stack and their results will be passed down the computation chain until it’s over. Note that this could be improved by usingstd::invoke
to truly support genericCallable
objects. Additionally, a scheduler-aware version ofthen
could be provided, which might schedule thestatic_cast<F&>(parent)()
call asynchronously instead of executing it synchronously, in order to avoid stack overflows - a possible drawback of that design choice could be introducing “deadlocks” in thread pools with a limited amount of threads (that will be waiting on each other).
So, what is the type of
f
?
decltype(f)
is huge, because there is no type erasure.
Let me show you an intentionally-produced compiler error from
the snippet with 22 .then
continuations:
This is the end of the first part of “zero-allocation continuations”.
In the next one we’ll take a look at a possible implementation of
when_all
, and experiment with thread pools.
I began experimenting with these kind of continuations a few months
ago - I was writing a library called orizzonte
that I abandoned to focus on other project (and because I was
frustrated by the lack of C++17 compiler compliance back then). I
made it public on GitHub today and might continue/rewrite it in the
future - regardless, you might find this
file interesting, where very generic (but slightly broken)
versions of .then
and when_all
reside.
Thanks for reading!