Cache-oblivious hierarchical recursive divide-and-conquer usage model
The cache-oblivious recursive divide-and-conquer model is a standard way of dividing a large task/problem recursively into smaller sub-problems (tasks) until each of those sub-problems (subtasks) becomes small enough to solve them efficiently using a single thread (i.e., sequentially). The generated subtasks form a directed acyclic graph (DAC) among them and can be represented using fork-join (spawn-sync) primitives. Many practical problems such as sorting, matrix multiplication, many dynamic programming problems, stencil computations, linear algebra kernels can be solved using a cache-oblivious recursive divide-and-conquer technique. The word cache-oblivious here means an algorithm that does not use cache parameters in the algorithm (unlike a tiled algorithm which chooses a tile size to fill the cache wisely). It has been shown that for a class of problems cache-oblivious recursive divide-and-conquer technique leads to highly- (parallel, faster, cache, bandwidth, and energy) efficient solutions compared to state-of-the art methods (IPDPS 2015, PPoPP 2015, 2016, 2017). Furthermore, the resulting algorithms are also self-adaptive and robust to dynamic fluctuations in shared system resources.
Randomized work-stealing schedulers (e.g., available in Cilk and TBB) are perfect for scheduling these algorithms on modern multicores. However, problem comes when one wants to use hierarchical hybrid architectures (e.g., CPUs + coprocessors), since work-stealing in such environment is costly. Having a general purpose hierarchical heterogeneous tasking model that can do schedule such algorithms on hybrid distributed-shared memory systems will be highly beneficial.