Latest revision as of 18:00, 3 February 2026

Description	Regular monthly meeting for the HiHAT Program
Next Meeting	2021-10-19 at 10am Pacific, 1pm Eastern
Regularity	Monthly
Termination Date	2022-10-13

Contact Information for HiHAT Monthly Reviews

Zoom https://zoom.us/j/373901193
Dial: +1 646 558 8656 (US Toll) or +1 408 638 0968 (US Toll)
Meeting ID: 373 901 193

Meeting Minutes

This wiki page is used to keep minutes from phone and face to face meetings on the topic of usage models, user stories, and applications for heterogeneous hierarchical asynchronous tasking. Most recent meetings are listed on top. See Presentations for the ongoing agenda for monthly meetings and the materials that were posted.

A link to get back up to the parent page is here.

Potential future topics

BSC graphs on CUDA Graphs with OpenACC
GROMACS
DaCe (rhymes with face)
Layering, hierarchy, dynamism - especially in a multi-node or NUMA context
C++ futures for shared memory distributed system?

!! Be sure to log in before you save, else changes will be lost !!

NVIDIA cuTile, Stephen Jones, Feb 3, 2026

Participants include Stephen Jones, Wilf Pinfold, Michael Wong, Remi Chassagnol, Boeschf, George Bosilca/NV, Hans Johansen, Ian Hendriksen/SNL, Jim Phillips, Jan Ciesko/SNL, Jeongnim Kim/Intel, Joseph Schuchart, Lukas Drescher, Maria Garzaran/Intel, Nitish, PiotrL, Stephen Olivier/SNL, Thiago, Tim Blattner/NIST, Walid Keyrouz/NIST, CJ/NV
Concise, high level abstraction, simpler than CUDA. Native types like arrays, no sync. Compilers have more freedom to optimize, more productive
- Using unreleased AI agent to compare across programming models. See TileGym.
Span variations in underlying machine parallelism
TileIR is an available compiler target, sits alongside PTX, targeted from Triton, cuTile, existing or new DSLs
Family of languages - Rust, C++, others
Interops with other CUDA code
Compiler tries to figure out best place for memory, does pretty well, e.g. within 10-15% of peak. Customizes to target arch. Can override with hints. Can invoke DMA engine.
Hans: Tile size is important? How do you control it? SJ: Autotuner.

Template Task Graph, Josh Schuchart, Jan 16 and May 21 and July 16, 2024

5/21/24 Participants include Wilf, Joseph Schuchart, Alexandre Bardakoff, Andrey Alekseenko, Michael Wong, PiotrL, StephenJ, Szilard Pall, CJ
7/16/24 participants include, Joseph Schuchart, Alexandre Bardakoff, Andrey Alekseenko, Michael Wong, PiotrL, StephenJ, Szilard Pall, Jan Ciesko, Walid Keyrouz, CJ
Overview
- Based on graph work at ICL/UTK, including PARSEC
- distr data flow as abstract teask graph
- nodes are template tasks, no preemption
- unrolled during execution - SPMD
- concurrent execution across task graphs
Input and output edges
- Can send data to some or all outputs
- Task can't execute until all inputs available
- Explicit naming of targets, using keys passed in as a paramter
- CJ: No separation between implementation of a function and the context of that node's use. This precludes a separation of concerns
- The intention here is designing the algo that consists of nodes and edges. No anticipation of task reuse.
- Szilard: Complex schedule, record a sequence of nodes and edges. Prebaked schedule by a single dev who has most all of the context. If there are extensions, then you might lack reuse.
- THought about splitting out of a task. Hard to send across functors.
Data
- Moved into graph without making a copy
- Infra for zero copy xfers across nodes
- Haven't done this in the context of host-side NUMA systems wince it wasn't an issue, but are doing this for GPUs, where automatic load balancing wasn't good enough
Memory model - device
- TTG manages device mem, host is a backup for transparent eviction wrt oversubscription
- Buffer has logical home on host, currently always materialized there
- CJ: we as a community are moving to consider accessors vs. materialized data buffers since data sets may not be able to fit in any amount of memory
- Stephen: Do you shared the or partition data? No, just replicate.
- Like Kokkos view, can be owning/non-owning. Track last update/access. Always get latest version.
- Strong consistency, since all deps are explicit.
Execution model
- declare input and scratch data
- TTG runtime assigns device and execution stream based on inputs and device load
- co_await select()'s device, co_await kernel() awaits kernel
- Sending outputs is the last step of a task
Next time
- Results, apps, compare/contrast discussion
- Target next month, June 18

7/16/24 continuation

Composition granularity
- Backward looking, based on integration pts [implicit dependencies] - data str (StarPU, Legion), mem loc (OpenMP, OmpSs), futures/promises; or task-wise composition, distr data str, implicit contracts
- Forward looking [explicit deps], based on pure data flow, task graphs, algos composed thru edges
- Ex: POTRF
Advantages of this approach
- Treat tasks like functions
- Free from objects on which deps get inferred
Challenges of this approach
- Can't infer dep graph based on changes; dev has to grok and specify and update the whole graph
- Lack of object-based dep management introduces potential flow control problems
- Lots of versions of inputs when producer gets ahead of consumer
- Back pressure. Add control flow edges as a back channel to send an empty msg to producer task. Experimenting with buffer depth.
- CJ: Need a global view wrt resources? How does a give programming model approach help or hurt robustness? Consider a 5-stage pipeline where are all resources could get consumed by early stages that prevent progress in later stages.
- CJ: Is there a separation between creating the graph via implicit inference and explicit construction and the instantiation of the graph with adequate resources?
- Stephen: But there are more dynamic allocations of resources possible once the graph is known, rather than doing everything statically. There can be sparse data that would drastically overcommit resources in the static case.
Target apps
- Block sparse matrix algos for sparse tensor algos in TiledArray. Not operating on fixed data structures. Tiles flow thru graph without materializing, so no risk of corrupting data. Would otherwise have to put a copy of data where dep could see it, make it globally visible. Instead, runtime can see and access and transparently copy local data.
- Multi-resolution function analysis in MADNESS
CJ: Let's put a plan together to bring this back for a deeper-dive compare/constrast analysis of perf, resource usage, robustness, debuggability, productivity. Let's bring this back and try to draw in reps from OpenMP, OmpSs, StarPU, HPX to compare/contrast.
CJ: How about doing some of that analysis of perf and productivity at p3hpc.org's SC24 workshop? Maybe a panel?
Piotr: Maybe a WIP session there.

Kokkos Graphs and Conditionals, Jonathan Liflander, Jan 17, '23

Participants: Stephen Jones, Wilf, Jonathan Liflander, SzilardP, CJ

Simple diamond conditional, pick one saxpy or another.

4 nodes
Non-conditional version does a deep copy of the conditional value to CPU, evaluates conditional on CPU and invokes a graph with one side or the other

Perf

More iterations helps - negative for 10 or 100, positive for 1K, 10K
Instantiating graph takes about as much time to run the graph once
- Without conditionals: 216us, seems high
- With conditions: 2.76ms, seems very high - order of magnitude more
Larger problem sizes wash out the impact, as expected
If the instantiation with conditionals was similar, conditionals would have been a clear win
Warmup didn't matter much. Stephen: been working to reduce sensitivity for that in CUDA 12

Conclusion

Looks promising; some gain now and we'd hope that'd be much more helpful
Common motif among several folks at Sandia. Conditionals are a big gating factor. Looking forward to more conversations with Kokkos kernels team.
Working on CG now, implementation nearly done, hoping for next week
Benefits may be unclear, especially if instantiation costs aren't predictable

Feedback

Stephen working to formalize the mechanism; should reduce overheads
Jonathan to share test case with Stephen, would like to have Jonathan chat with the graphs team
Not everything you can put in a graph can be put in a while

Discussion

CJ: Are there or will there by Python wrappers for Kokkos? JL: Not that much need. Mostly C++ or Fortran.
CJ: Can be difficult to recognize the processing of a list as a loop, which could have implications.
Jonathan is also working on Darma, doing work on load balancing in Python, interacts with C++ code. Uses output to guide distribution. This could be relevant in that domain.
Szilard: Python used only for high-level drivers, not in GROMACS context. More focused on balancing high- vs. low-level APIs. Some need for interaction with data operated on by inner loop, but the desire is for that interaction to be flexibly done with a Python API.
Jonathan: Python used more as a higher level. Data fed from app, in situ.
Szilard: Similar - hot loop, capture and export data, e.g. for viz, sometimes modifying. Can imagine some perf requirements. Cases not so well documented, lots in early stages.
JL: Can tank perf if not careful -> Python is pure driver code
CJ: Expressing control flow in a way that's amenable to Python may be of increasing interest

CUDA Graph Updates, Stephen Jones, Oct 18, '22

Participants: Stephen Jones, Wilf, David Fontaine, PiotrL, WalidK, SzilardP, CJ

External dependencies via events and memops
- in/out deps not permitted with CUDA Graphs, have to split them into diff graphs connected with sync
- Memops: cuStreamWaitValue(), cuStreamWriteValue()
Stream-ordered memory alloc: async take a stream argument
- cudaMalloc/FreeAsync
- Can allocate in one graph and free in another, virtual addresses are unique across graphs, but phys addresses may be reused
- Can change topology of graph for edges connecting nodes involving memory management
- CUDA may add more edges to serialize to avoid OOM. Create a memory pool before beginning to execute a graph instance that satisfies the worst-case paths thru the graph. Wait to start the graph if you can't create that pool. Add extra edges based on the conditions evaluated at the start of that instance's execution.
- CUDA implicitly inserts postdominators as necessary in order to track when a graph completes and all of its held pool of unallocated memory can be made available for other graphs
Dynamic parallelism
- used to be able to launch kernels from kernels, now you can launch graphs from graphs
- Encapsulation boundary - GPU stream within a CPU stream
Named streams in CUDA 12.0 later this year; CUDA names classes of patterns that you can use; you're not naming streams
- cudaStreamPerThread - tell CUDA you can optimize around a straightline pattern
- cudaStreamFireAndForget - device-side kernels issue concurrently vs. sequentially
- cudaStreamTailLaunch - sequential after parent completes
- 3x faster with fork/join disabled (22.5 -> 7.2 us), very close to CPU launch (6.5 us). 1.14x in Mandelbrot test
Encapsulation for device-side graph launch
- Whole launching graph, so can't create new dep that induces fork/join within the parent graph
- Similar named streams for graphs
- Adding sibling launch for loops. Sibling is launched outside of deps, a level up, creating a nested encapsulation
- Can make a decision at runtime about launching an appropriately-selected graph by relaunching scheduler after each launch in a loop
- Launch from device is 2.2x faster than from host 4.5us vs. 9.9us. Lower latency from a shorter control loop.
- Extra bookkeeping on CPU is avoided, so pretty flat on GPU vs. scales with concurrency on CPU

Hedgehog Static Analysis and Multi-node Applications on Heterogenous Systems April 19, 2022

Abstract: In this presentation we investigate the different usage of static metaprogramming technic in Hedgehog to secure the computations. We especially showcase how we make the API safe and how we propose an extensible compile-time library for representing and testing the Hedgehog graph structure. Then we will provide a sneak peek to the newest features of Hedgehog v3. Lastly, we will discuss our on-going efforts of extending Hedgehog to support execution on multiple nodes, while still utilizing Hedgehog's excellent single node performance. The approach taken is to draw on Uintah's runtime system for large numbers of compute nodes to produce a prototype extension to hedgehog that allows execution on potentially large numbers of nodes. This is further demonstrated by showing an example of matrix-matrix multiplication using Hedgehog over multiple nodes.

Attendees:Alexandre Bardakoff, John Holmen, Martin Berzins, Tim Blattner, Piotr Luszczek, David Bernholdt, David Fontaine, Stephen Jones, Wael Elwasif, Nitish Shingte, Wilf, CJ

Presentation 1: Hedgehog Static Analysis - Alexandre Bardakoff

Presentation HiHat Static Analysis

Goals

Conformity checks with Template Metaprogramming
Static graph with constant expressions

Presentation 2:Multi-node Applications - Nitish Shingte

Presentation HiHat Cluster

Goals

Extend Hdgehog to support Multiple Task graphs Multiple Nodes

Adding Conditionals to Kokkos Graphs Mar 15, 2022

Participants include

Jonathan Lifflander, Alexandre Bardakoff, Tim Blattner, Jan Ciesko, Piotr Luszczek, David Bernholdt, Szilard Pall, David Fontaine, Wilf, CJ

Investigations

Used warmup: create 1000 nodes in CUDA Graphs to warm up memory pool
Simple diamond of 2 AXPBYs
Testing functionality

Initial results

Deep copy is a slowdown, cudaMemcpyAsync after each call during which compute is idle, 10 us
Host pinned is worse
UVM really tanks perf; streamSynchronize took a lot longer - work took longer

Added conditionals, so keep value on GPU vs. deep_copy or UVM

Bug with (sentinel) empty nodes inside conditional. Sentinel nodes make it easier to express control dependencies.
Such sentinel nodes could be combined with the work that follows
Jonathan plans to file the bug in NVBugs, has a reproducer

Analysis and Discussion

More gain with more concurrency, especially when available parallelism exceeds what's supportable. Jonathan will try that out.
With conditionals, getting 1.10-1.18x speedup
Tie gains over previous approach to causes that show up in the profiler?
CUDA Graph instantiate is taking ~200us, which inhibits speedup. Would otherwise be 1.3x vs. 1.1x. Amortizable, but this explains why fewer iterations don't show speedup. David: not expected that this takes so long. Jonathan will open a bug report.
Tried CUDA 11.4 for conditionals and it just crashed, needed to revert to 11.1. Seg faulted. Didn't try anything between 1.1 and 1.4.
Useful for Sandia's use cases. Started thinking about Kokkos Graph Conditional API.

Profiling

Conditionals not showing up. Full profiler support may not be there for the experimental feature.
Overheads? 1.3-1.4x slower with profiler, on lots of very fine-grained kerneles
NVTX? TSC? Not tried yet.
Seeing kernels that he'd expect to be concurrent to be sequential, even though they were on different streams. Jonathan will try playing with profiling options.

Kokkos Graphs

Benchmarking Kokkos Graphs Oct 19, 2021

Participants include

Jonathan Lifflander, George Bosilca, Jan Ciesko, Piotr Luszczek, Szilard Pall, Wilf, CJ

Jonathan

Leads DARMA at SNL
Driving Kokkos Graphs after Daisy moved to Google

Investigations

Reduce kernel launch overheads
More concurrency, easier than streams?
How much Graph reuse is necessary to amortize overheads?
used iterative CG on Sierra with CUDA 10, Kokkos dev't machine with V100s with newer CUDA driver

Results

Per-iteration benefit increases with more iterations
for older driver, CUDA 10
- 70 us --> 40 us; ~1.5x for smaller sizes but not large
- Slow down of 0.6x for few iterations, 30-40 iterations was the breakeven point
- 3ms instantiation
for newer driver, CUDA 11.2
- 200us instantiate
- But cutover point didn't really change
- very nominal benefit before removing the startup cost; could become a 30% improvement

Analysis and Discussion

Wished for better tools than just Nsight
40us Kokkos overhead seemed reasonable
Stephen/CJ: takes 7us to allocate pools the first time; <1us thereafter. Try allocating and then destroying a 1000-node graph as part of init
Szilard: Especially newer profiler can be really invasive, especially for small resolution.
CJ: Can't you use NVTX?
Szilard: Even just tracing seems to be measuring too much. Favors TSC without the profiler to be sure.
Jonathan: Will try experimenting with that. But wall clock-only measurements line up with profile-based projections.
Stephen, offline: If you measure anything other than kernel-boundary timings (i.e. if you're measuring IPC or cache misses in the profiler), it'll flatten the graph to a straight line. Kernel boundary timings still add in extra overhead as well, because the profiler has to disable some of the hardware work queue chaining in order to insert the timestamps. It's a smaller effect with a scalefactor-style offset though so not so bad. Profiler overheads reduced around 11.2
Almost no concurrency in CG. CJ: More concurrency should show greater gain. So would reducing the utilization on the CPU side.

MPI forum WG on streams and graphs Sep 21, 2021

Participants include

AlanG, Benoit Meister, CJ, DavidB, FabianM PiotrL, Rishi Khan, StephenJ, Wilf

Context

MPI Hybrid accelerator WG
Stream and graph triggering, kernel triggering

Reduce overheads

Stream triggered: transfer of control between CPU and GPU
Kernel triggered: offloading; overlap compute and comms

Streams and graphs

Discussions have started with streams, expanding to graphs
Vendor agnostic
Could wrap these with typed C++ bindings

Graph Characterization July 20

Participants include

Alan Gray, Jan Ciesko, Piotr Luszczek, Stephen Jones, Szilard Pall, Wilf Pinfold, CJ Newburn

Next mtg in Sep; skip Aug.

Kokkos graphs by Jonathan Lifflander
Jim Dinan's MPI proposals

Kokkos Graphs

Some characterization done by Jonathan Lifflander - see https://github.com/kokkos/kokkos/issues/4056
Comparison of CUDA Streams, CUDA Graphs, Kokkos Graphs
Plan to cover this in detail in September

Possibly more on GROMACS in two months; we'll see. Issue: there may be different orderings with the same topology. That may not be addressed for a while.

Piotr working with OpenMP 4+ task graphs

Work with task depend clauses
Launching customer kernels
Haven't yet found a compelling use case for CUDA Graphs. Tracking and graph instantiation done on OpenMP side.
Trying to avoid launch overheads by increasing granularity through code restructuring and using larger block sizes
Sending smaller tasks to the CPU is becoming less viable as more data is on GPU. They have a data tracker to tell where data is, tries to pick whether to launch on CPU or GPU based on that.
MESI protocol in Slate
Mixing of CUDA launches and MKL for CPU in SLate (also MPI) and Plasma
GROMACS tracks some data but only with hand-tuning, nothing automated. Brainstorming about such schemes but granularities are so small (10s-100 of cache lines) that no tracking and decision scheme has low enough overhead to make it worthwhile. Usually avoid running multiple independent tasks partitioned across threads within a rank. Try to optimize for reuse across consecutive tasks on the same threads.
Stephen: Existing mechanisms to pin data in cache.

Ordering across multiple graphs

Jan: Can have a graph on host.
Stephen: Looking at interaction with MPI. Concerns about recursive CUDA calls that could happen from an MPI callback. Callbacks to the CPU in CUDA are pretty rudimentary, e.g. wrt thread affinity for the context of the callback.
Jan interested in collaborating on evaluation and prototyping. Important topic.
Stephen: Time frame is 3-6 months for relaxing restrictions in an MPI context. Jim Dinan has MPI proposals ready now.

Characterization of Sample cases

Participants include

Alan Gray, Hans Johansen, Jan Ciesko, Jose Diaz, Piotr Luszczek, Stephen Jones, Szilard Pall, CJ Newburn

Interested in

Task granularity
Task concurrency
Criticality of certain paths, how well known that is to the programmer, and how dynamic that is
Reuse
Dynamic memory allocation
One task generates a parameter used by another task

CUDA Graphs in GROMACS - scheduling and priorities - Alan Gray, Szilard Pall

Reuse: inner loop of 50-200 iterations
Task granularity: ?
Task concurrentcy: max 2 for 1 GPU; with 4 GPUs, 1xPME, 3xPP
CPU/GPU: focused on GPU only for now
Graph handling: could span multiple steps but they don't; would introduce challenges wrt irregularity
Priority: want to prioritize PME kernels but CUDA Graphs doesn't recognize that yet

MPI and communication

GROMACS' internal thread-MPI lib uses cudaMemcpy within the same process, and that enables capture
What if cudaIPC were used, could capture happen across processes? Theoretically, a cudaIPC event could be captured but how would 4 captures be synchronized? Perhaps, if you do a big barrier before and after you capture.
CJ: Does the graph need to know or care that there's an external input or output dependence?
Stephen/CJ: You can identify interface nodes ahead of time and hook the graphs together beforehand
Szilard: and then prune away unnecessary deps
Stephen/CJ: Multiprocess launch, but you were doing that anyway and consumers will just wait for the producers
Jan: Lots of research on this 10 years ago in Europe. Producer/consumer relationships. Tasks knew which data they were consuming, unblocked tasks waiting on the receipt of given data
CJ: Good optimization to block tasks until data received rather than spin and wait inside the task. HiHAT graphs looked at codifying that in an interface node that managed interfacing with entities outside of the graph, whose successor deps were not satisfied until completion of that interface node.
Jan: Cholesky is a common use case for data flow. Different that control-flow-instantiated graph; ordering is based on data readiness
Stephen: CUDA Graphs design favored control vs. data flow since except for streaming, data flow is trivially mappable to control flow.

Asks

False dependence from last iteration?? when recording multiple iterations

Hans Johansen, LBNL- need for a dynamic "queue" API

Non-linear Jacobian for chemistry systems in 3D simulations
- Sparse matrix, redo with different parameters and differences in convergence behavior quickly induce warp divergence
- Stephen: similar to ray tracing problem, Ray Tracing guys also wanted something at warp granularity
- CJ: related to SINGE work being considered to solve a similar problem
Parallel iterative methods, batched lienar algebra
- Batched solvers have different matrices each of the same size
- Use of direct (pivoting?) or iterative (conditions?) solve also leads to different convergence bahavior
- This is why batched cuBLAS has no pivoting
Requested solution
- Run until there's no more work without kernel or graph re-launch
- Could start diverging after convergence
Discussion
- Stephen: also lack of data convergence within a warp
- Access to shared memory, willing to gather/scatter to precondition, willing to tolerate lack of warp contiguity
- Graphs - iteration

Usage scenarios of interest to Jan Ciesko @ Sandia

A blocked SpTRSV
A Gauss-Seidel - the wavefront there is nice, comes in at 1/2 the memory compared to Jacobi
Cholesky factorization - not iterative but we might be able to show better hardware utilization
LU factorization

Next time

Provide more characterization data - see above
- Alan/Szilard, Hans, Jan to all add to these notes what they can
Look at interests and futures in prioritization of the critical path

Heterogeneous task scheduling of molecular dynamics in GROMACS - Szilard Pall

Participants for Jan 19 include

Szilard Pall, Alan Gray; David Bernholdt, Fabian Cordero, Simon De Conzalo, StephenJ, Wael, Wilf, CJ

Participants for Feb 16 include

Szilard Pall, Alan Gray; AntonioP, David Bernholdt, HansJ, JiriD, Simon De Conzalo, StephenJ, TimB, Wael, Wilf, CJ

Participants for May 18 include

David, Piotr, Hans, JohnB, JoseM, Stephen, Szilard, Wael, Waid, Wilf, CJ

Szilard's presentation

Context
- MD, with halo exchange
- Domain decomp, kernels, reduction, integration
Characterization
- Moderate granularity for tasks. Total time per iterations is <10ms, and 2ms is common for strong scaling, can be 100s of us. Limited by all2all comms.
- Concurrency across kernels is around 12, could be a total of a couple of dozen
- Looking at how to improve scheduling with CUDA Graphs, how to increase tput, especially strong scaling
- Inherent requirement: launch in inner loops and reuse 100s of times, also use on CPU, eventually use internode comms.
Caching
- MD runs out of cache, materials are latency sensitive
Decomposition
- Spatial
- Resource re-allocation and load balancing changes made in outer dashed blue loop, not inner grey.
- Heuristics are a mix of table lookup and measurement based
- Published last year - Heterogeneous parallelization and acceleration of molecular dynamics simulations in GROMACS

J. Chem. Phys. 153, 134110 (2020); https://doi.org/10.1063/5.0018516

Parallelization
- MPI + OpenMP for CPU, CUDA or OpenCL for GPUs
- "A few years ago, would have said anything for perf. Now, with programming model wars, only use models where help implementing is provided."
- Subset of ranks dedicated to orange FFT work; faster with a limited # of ranks
- Haven't found a way to rebalance dynamically. No time to steal tasks or shift data. Quite stable across 100s of iterations.
- For certain types of data, purple or magenta tasks may go away.
Perf challenges
- Making data resident on GPU to avoid PCIe overheads. Migrated MD step to GPU, CPU is helper that enables compatibility with a wider range of features. CPU back-offload does infrequent steps.
- Launch overhead. Exploring CUDA Graphs. Have looked at scheduler threads in the past.
- GPU P2P for the sake of rank-rank communication
Forward progress not guaranteed
Topology and affinity challenges
Multi-level load balancing
- Measuring wall time of CPU+GPU is not possible with cudaEvents, so dynamic scheduling forced off

Alan's presentation - where/how CUDA Graphs helps

use CUDA Graphs capture
link streams with events
Used a proxy code; Stephen had some feedback
Perf
- Always positive for single-GPU cases. Benefits from cudaGraphExecUpdate.
- Lower for multi-GPU cases, sometimes slower. Implementation limitation for cudaGraphExecUpdate in a multi-GPU context since may not be recaptured the same way. Projections show would help, but still need threshold to figue out when to apply CUDA Graphs.
Trade-offs between one graph per timestep but smaller scope means less benefit, and host nodes which could have perf limitations. Composability with libmp may help? CUDA Graphs now support cuStreamWrite/WaitValue32 natively, but it still involves a kernel flag.
Reuse
- 100s of iterations
- Update when you can. Likely to capture differently, with different ordering.

Working on minimizing that effect.  Expecting not to be a major problem.  Dependencies are stored separately and are determined by app.

- Reason for re-capture: change in parameters only, update. Now cheap enough to update the whole thing that you don't care, as long as you don't re-instantiate.
- Recording is cheap, instantiation is expensive
- SP: Can you not launch a task of size zero?
- SJ: We have other asks for that and that looks feasible. Could be done by setting a flag to false. Could also use conditional execution - greater generality but higher latency. Sounds like you can know this already at launch time.
- SP: That could be more elegant, but not necessary except for more irregular apps. Might need 2-3 graphs, because of different steps.

Discussion

Do you have monitoring infrastructure that captures tasking overheads and scheduling issues?
Does CUDA Graphs' CPU and GPU nodes help? Does the need to "pick" CPU or GPU at node creation time inhibit any meaningful flexibility?
Does exposing all dependencies to CUDA Graphs help the deadlock (livelock?) problem?
Should work through timing issues related to dynamic scheduling
How easily and cleanly was the CUDA Graphs-ness encapsulated?
HiHAT
- Truly hierarchical and hetero
Affinity
- Manual management of thread affinity with GPU nodes vs. host nodes
- Not so sensitive to latency
- Need reuse of thread pool with affinity
- Can use two pools, e.g. for blue and magenta tasks, but then you don't get locality. Blue and magenta tasks have very different compute requirements.
- Focused on sharing in L1, L2. Aggressive in managing affinities if not imposed from outside already.
- Using TLS within OMP loop scope, so must be in the context of thread with TLS.
- Stephen: would have to be explicit; In CUDA, register a callback and wait for it. Consider a blocking call to wait for event.
CPU nodes: Some code not ported to GPU
- Can expose as a CPU node (few us), or to reverse offload from a GPU node (few 10s of us).
- Polling off of memory locations
- Stephen: need to find ways to burn a thread to spin or suffer interrupt latency
- Szilard: Have plenty of CPU resources, so polling shouldn't be an issue
Key ask: Need to work on integrating comms. Queue this up next month.

May 18

Transition from queue/stream to priorities, critical path acceleration
Thinking of memops and streams, building task graphs vs. just recording them
How should we think about the critical path and expression priorities to the runtime
Internal discussions
GROMACS going through a transition for some back ends that are graph scheduling first. SYCL BE. Has been CUDA (in-order queues) first. SYCL doesn't have support now. CUDA is superior wrt features.
Porting is a big effort for a small team.
Starting with explicit dependencies
After looking at multiple use cases and multiple optimizations, realizing that depending on explicit deps would be a liability for correctness. So thinking more about task-based runtimes, how to build task graphs.
Static problem, know the deps ahead of time. Short iterations. Graph is static over a few iterations.
More of a SW engineering challenge. Mixed codes expression CPU-GPU async as well as CPU only in same code base with multiple code bases.
Don't want breadth first if a small fraction of kernels are on critical path.
Perhaps the runtime could enable measurement of time, so we could derive weights of nodes and/or edges. Have a pretty good guess as a programmer. Already do load balancing on CPU, based on measuring compute without communication.
Priorities in the presence of preemption and a greedy scheduler would not be sufficient, unless there is a mechanism for learning over time.
SP: Would be a step forward to express edge weights as metadata. Averages may not always be representative. Fine-grained changes may not make sense for 10s-100s of usec.
SP: Capturing conditionality would enable building in load-balancing capabilities
SJ: Could gather stats. Could steer control flow.
Example of streams and graphs?
SJ: GPU does its own dynamic load balancing, but you can set priorities. Prioritized speed of decision making over flexibility. Can put a marker at end of stream and query whether it's been reached. Breadth or depth first traversal? Now handled only by HW. Added an editable node property for priority from the beginning.
SP: Want B2B not just priority. Want to avoid latency of draining lower-priority work.
SJ: Long challenge to intentional packing. Often hear case of big rocks first.
SJ: Observatoire de Paris - forks left 90% of time, favor it in priority. Make a decision on CPU side which graph version will be launched. Biasing which you fetch and prebuild.
Hans: use case for CUDA Graphs - iterations where branch taken as a function of # iterations. Creates bottlenecks if you assume lock step, since it eats up occupancy. Common for chem kernels. Have to get something done to get back into lockstep. Can't predict what will provide full utilization. Could be priorities on mem prefetch if that's a bottleneck. Would prefer to swap work in dynamically. Work stealing and stuffing. May use ptr swaps.
CJ: could have side threads packing a slate of work, making a trade-off of packers and workers.
SJ: Can only use all threads; can't sequester a subset of resources for independent graph management on each. There are others asking for this. Can't evict work that's currently on the machine, can only stop dispatching new work.
SP: If initial latency can't be reduced, you at least want to avoid recurring latency from packing.
CJ: Specific use cases to drive an assessment of whether priorities are sufficient.
SP: Next step: plot concrete instances using dot interface, can try recording subgraphs into a stream.
SJ: Interest in control or heuristic influence over execution order.
Piotr: tend to have a well-defined critical path, but can get muddy with very large graphs. Graphs often get latency sensitive since they are sub-usec duration. Small tasks with high priority. No one there has ventured into CUDA Graphs. Mostly inside OpenMP tasks now. Can find out whether a transition from OMP tasks to CUDA Graphs is of interest.

Thoughts on Integrating CudaGraph Features into Performance Portability Libraries, Oct 20 and Nov 24, Daisy Hollman

Participants for Oct 20 include (19)

Daisy Hollman, DavidB, WaelE, StephenO, AlexB, BenoitM, DmitryL, HansJ, JanC, JohnB, JohnS, Leonel, PiotrL, SzilardP, TalB, TimB, Wilf, CJ

Participants for Nov 24 include ()

AlexB, AntoninoT, CJ, DaisyH, DavidB, FabianC, Jan Ciesko, JohnB, PiotrL, StephenJ, StephenO, TimB, WaelE, Wilf

Goals of KokkosGraphs

...

Insights

Consider what you're really trying to accomplish with graphs
CudaGraphs does what a vendor layer should do - can write things on top of it
Could write a higher-level scheduler on top of this
Concerns about only being able to change parameters and not attributes and structure. Wants to add/rm nodes.
Mapping a restricted programming model onto CUDA Graphs as an execution model

Questions to ask

Cognitive load
Perf cost of excluding functionality
How does it look in other execution models?
Alternatives if functionality excluded
How many apps benefit from the direct expression of a particular feature instead of the alternative expression when it's excluded
Ex: cudaGraph[Exec]KernelNodeSetParams
Modify in place or delete then re-insert?
Parameterize submission for execution with params?

Discussion and notes

Concrete plans in Kokkos for SYCL Graphs, which isn't very mature yet
Interesting transitions: Launch-launch, destroy-build
SJ: Trying to expose what's efficient and not? DH: Yes. Good to make it look expensive to reconstruct a new graph.
SJ: By removing some arrows, all changes may end up looking equally expensive, which they are not. DH: Trying to recover that, with persistent objects which can be modified while in or out of execution phase. SJ: Makes sense.
DH: Really wants conditional execution. SJ: Available via a special experimental header.

C++ object lifetimes

Clean up automatically, avoid move semantics
Use one object for build and another for the whole life of the graph. Similar to nested object lifetime scopes, e.g. immediately-evaluated lambda idiom in C++ or Ruby frozen object model.
Don't have to figure out how to build graph while it's executing
Task can't create more tasks, no hierarchical structure to worry about
No conditionally-executed tasks
No need to recover from errors.
Still present: can't allow the user to create dependency cycles

Plan to continue next time, in November

Use cases that create need for generalities that are lost with the above simplification
Come back with examples
Daisy to summarize key simplifications for users and implementers

Tasking_in_Accelerators_CudaGraph+OpenACC, Aug 18, 2020, Leonel Toledo, Antonio Pena

Participants included

Alexandre Bardakoff, CJ Newburn, David Bernholdt, James Beyer, Jeff Larkin, Leonel Toledo, Stephen Jones, Tim Blattner, Stephen Olivier, Wilfred Pinfold, Wael Elwasif, Fabian Mora Cordero, Piotr Luszczek

Showed speedup from adding CUDA Graphs to OpenACC, compared with OMPSS and other implementations
Explored other topics for a small audience: dynamic parallelism, placing streams creation overheads in different loop levels, sensitivity of perf to exposed parallelism
Used DGEMM on P9+V100s for perf evaluation

Hedgehog, Alex Bardakoff and Tim Blattner, Mar 17/Apr 21/May 19/Jun 16/July 21, 2020

Participants on Mar 17 include (14)

TimB, Alex Bardakoff, DavidB, DmitryL, GeorgeB, JamesB, Jiri Dokulil, Jose Monsalve Diaz, MauroB, PiotrL, Szilard, WaelE, WalidK, Wilf, CJ

State
- Associated with each action
- Can have multiple state managers
Memory management
- Can stall and wait for another free using a condition variable
- Can have multiple memory managers
Graphs
- Profiling and graphical representation helps visualize bottlenecks
Safety
- Coherency. Want to use C++20 context object
Perf
- Scales linear with threads, usec latency is tolerable
- Good performance on Windows
Applications
- Microscopy, ...
Follow up items
- Connecting multiple graphs, e.g. across nodes
- Memory management plumbing and gating
- Feedback on HiHAT's profiling, collaboration on a common profiling API
- Walid is interested in revisiting the above
- CJ: let's look at sharing APIs and notes on the above and following up next time

Participants on Apr 21 include (18)

SteveA, TimB, JamesB, AlexB, AntoninoT, HansJ, JingX, JiriD, JohnB, JohnS, PiotrL, Shuvra Buattacharyya, Stephens, WaelE, WalidK, Wilf, Wojtek, CJ

Recap
- No global scheduler, no global state
- Shared input queues -> mutual exclusion. Localized work stealing wrt a limited set of producer(s) and consumers. Could have multiple producers. Can have multiple input types.

Why is identification of input and output nodes by user necessary? Because compile time only?
- Actions are based on input types
- Use traits to check compatibility
Push (define async, reactive tasks served by worker threads) doesn’t preclude a pull (proactive) model. How do tasks create data and generate work?
- Queue created according to addEdge and type
- One queue per type. Multiple producers will send info to that queue, don't know where the trigger comes from unless you embed a source ID.
- Could declare an edge more abstractly from a whole task, not all of the program points inside it, to an error handler.
- addResult gets broadcast to all tasks
- Could expand queue functionality to do error handling
Can a task have many out edges that get triggered prior to the completion of the whole task? For example, could split vector task have let a batch increment task (that no longer subdivides) start working as soon as 1/10th of its work, i.e. the first split, was done?
- Completely async, addResult pushes work to be consumed
(Stephen:) Can you do throttling and how?
- Can throttle by # items
- addResult stalls? No, getMemory will happen before that. Based on resource starvation vs. backflow. Potential for deadlock, e.g. declaring that a resource will be used 10 times before being given back; if you don't use it the 10x it may never be given back.
(Stephen:) As a dev, do I need to make an effort to go depth first vs. breadth first in order to not over-consume resources?
- Have to understand resource usage and ordering as a programmer
Could pushData have been simply a data movement node?
- Can get read/write race conditions. Duplicate as needed.
- Jing: Only shared_ptr, not a deep copy. What if you have multiple reads of the same data? Alex: need a copy per consumer.
- A task could be attached to a file. Developed a lib called task loader to read tiles and push them to later stages. Could be attached to a network socket.
- Could specialize tasks for data generation, but haven't. Could create a repo of reusable code.
I’d be interested to know where the gap comes from between “first block time” and “average block time”. Is the first exceptional, or are all blocks slightly different times (so last block is faster than average)? Is this an implementation detail or an effect of the runtime?
- 1st: how fast next computation can begin. Avg: across others.
finishPushingData tells it to not expect any more data beyond what’s already been received, so that worker threads can quit. Relevant for persistent kernels.
- Exactly. canTerminate can be customized, e.g. execute N times before termination.
What are the multiple conditions that could trigger dynamic execution, e.g. address (e.g. specific dependence) or generic type? How does a postdominator act on a stimulus from a then or else clause of a preceding (runtime) conditional?
- CJ: can a consumer declare that it doesn't want more inputs? Yes, but subsequent producers will still fill the queue. You can terminate the consumer but the queue will remain until the queue goes out of scope, unless you create some way to trigger back to the producer. Could have external factors outside of the graph's dependencies. Tim's done this before. Can use an addEdge from consumer to producer for this, but there's only support for one output type. Can't distinguish between input and output with variadic on both. Sticking to this firmly so far since otherwise you can't distinguish. Fundamental limitation based on templates. Have expanded to a super class for output types with an enum inside it.
How is feedback “costless?”

Participants on May 19 include ()

Wilf, Tim, Alex, Walid, Fabian Cordero, Wael, Szilard, Wojtek, Piotr, JohnB, CJ

Recap
- Data flow, pipelined tasks, async system with dependence resolution and decentralized resource mgt
- Explicit, compile-time oriented. Type-aware, template based, header oriented.
- Can't modify a graph once it starts execution
- ~1usec sync codes -> tasks are somewhat coarse
Start on slide 18 of the posted Q & A deck
What are group nodes? Do they relate to hierarchy?
- Please clarify "# clones specified by a task's constructor."
Slide 18: “Bind a graph to a GPU”. Does this imply graphs do not span GPUs? Do they have external incoming/outgoing edges in order to synchronize with graphs on other GPUs?
- 1 graph per device. Then the graph is programmatically partitioned per device for the case where execution on different devices is interleaved? There's no abstraction that'd help an independent scheduler do the binding and ordering?
- All data movement is done explicitly by the programmer; the system only manages dependences
- Some collaboration with College Park around a data flow system. It detected where data movement was needed and automated the realization of data movement into explicit task bodies.
- Is there any abstraction used to help set up cudaMemcpy across GPUs/processes? No.
Slide 16 mentions “Shutdown virtual method to break cycles.” This has implications on the underlying hardware which may introduce inefficiencies (for a GPU, this would be the case as upcoming work would not be able to be pre-fetched). A shutdown conditional node would solve this (i.e., prefetch disabled only for the successor to the shutdown node), but limits the locations where shutdown can occur. There are many other approaches: what does Hedgehog have in mind here?
- canTerminate can be overridden. Can specify a time to live (ttl). Otherwise would loop infinitely, deadlock.
Slide 41: “Asynchronous pre-fetch”. My own experiments on transparent multi-GPU execution indicate that performance is very sensitive to data locality. Is there a pinning operation for the data? Do you assume the pre-fetch is persistent? Would data migration be beneficial at all? What is the granularity with which these pre-fetches can happen?
- No pinning, no help for peer access, no help for data decomposition
- Again, consider layering wherein a scheduler can insert prefetches/data movement as needed, based on binding decisions.
State management
- Please explain more about why locks are needed. State manager owns a state object. Locks, sends, updates, adds to queue, unlocks, sends outputs. In an execution pipeline, graph is duplicated, some tasks may share metadata. What about multiple readers? Why do you need a lock if you just use a flag written after data? How do these locks related to a simple indication that a dependence is satisfied?
- Szilard: Is there a distinction between a [chunk of data from a task and a data collection from a set of tasks. Is there one lock for a range of chunks?] Tim: Guards only an element, not a range.
- Szilard: maybe have a benchmark for basic latency metrics, to compare different kinds of operations, including those that may have multiple readers
- Tim: no read/write conflicts by construction; operating on state between tasks, not the data itself.
- Walid: considered spin locks, went for mutexes. Serial pipeline on each thread, so only local state managers.
- This covered data; what about resources: compute, memory, network?

Ended here on 5/19

Perhaps for next time
- Dynamic vs. static
- Granularity - Szilard is operating with a time step of 1ms, tasks are 10s-100s of us
Maybe queue up sooner

MPI/OMP: interested in OMP task dependencies on buffer in/out for isend/recv?
Mem mgt
- How are alloc/free "encapsulated into the memory manager?"
- Can you limit available memory? If there's a memory manager, one may wish to limit its maximal allocation to something less.

Commentary
- Focused on their needs vs. more ambitious solutions
- Focused only on tasks, leave to other tools to instrument other items in the system

Planning to continue next month.

Participants on June 16 include

Adam, Alex, Fabian Cordero (UDel), James, Piotr, Stephen, Szilard, TimB, Walid, Wilf, CJ

Demo: Hedgehog with Nsight systems

Code example: matmul

Using unified memory from memory mgr
Set a time to live, dec ttl, free at zero
Grab from a pool, user specifies size of that pool

State management

Based on flow of data
Explicitly managed by user
Collaborated with U MD
OpenMP: overhead for spawning threads. Different paradigms of threading may conflict. Discourage OMP within tasks. Suggesting C++ mechanisms instead.
- Szilard: mostly uses OpenMP which work in most cases, have heard some bad things about C++ mechanisms. What about reductions, atomics, partial barriers?
- Alex: sought to do something portable across Linux/Win/MacOS. Rely on user to implement their own. user monitor implementation, condition variables.
- Tim: use existing libs such as linear algebra, but call them in single-threaded mode to harvest Hedgehog's thread-level parallelism. Could have a CUDA kernel invoked. Hedgehog seems to scale pretty well.
- CJ: Some libs may do more to exploit locality, but Hedgehog can offer TLP. Consider pSTL.
- Szilard: thinking more about finer-grained parallelism than Hedgehog is
- Tim: Haven't upgraded to C++21 or pSTL.
- Walid: 32 threads, latency 7--9.5 microseconds.
- Alex: Here is the latency graph comparing Hedgehog & HTGS implementation, where latency on last graph is time between a task sending a data, and a task receiving it https://drive.google.com/file/d/1IkIgpkNCN090KXqsUxO1ogQKjbvbYrfw/view?usp=sharing

MPI

No way to scale beyond 1 node. Collaborating with UINTAH. 1 Hedgehog graph per node, special interaction node.
CJ: Could look at HiHAT interaction nodes
Stephen: Use Hedgehog for comms vs. MPI, hide MPI?
Tim: Yes. MPI tends to be block sync, unlike this. UINTAH abstracts that away, can push state into task graphs.
Walid: They had more expertise with MPI at large scale. They already work with task graphs. Just started, hoping for results in a year or so.
CJ: Two paths: incremental use of graphs in each node and making MPI integrated into graphs and more async, or rewrite without MPI and use global task graphs and perhaps MPI or UCX underneath.
Piotr: MPI does not require matching. MPI_ANY_SOURCE and MPI_ANY_TAG work just fine

You can also use one-sided MPI which is even more decoupled. There is not even a requirement for matching buffer sizes. Just post your receives() with 2 GiB buffers and then check how much arrived.

Memory mgt

See slides
CJ: pass reference to data produced in graph or fix up the passed value after deferred execution submitted?
Tim: pass data explicitly from producer to consumer, don't want to update the state of another task

Hedgehog tutorials

how state works with more code: https://pages.nist.gov/hedgehog-Tutorials/tutorials/tutorial2.html
how we do computation on GPU: https://pages.nist.gov/hedgehog-Tutorials/tutorials/tutorial4.html

Participants on July 21 include

Alex Bardakoff, Antonio (BSC), CJ, Dmitry, Fabian Cordero (UDel), Kris Keipert (NV), Leonel (BSC, postdoc), Tal, Tim, Walid, Wilf

Input/output relationships, scaling to multi-node or NUMA
- Need an abstraction for data movement
- UINTAH examples. Deps being satisfied implicitly triggers data xfers.
- Could be moved more proactively, but there is no scheduler - only on demand now.
- Currently easy to reason about and design. Requires explicit design of data flow graph.
- Execution pipeline duplicates a graph. Share state when state manager is shared.
- Considered support for NCCL. Considering TensorRT, OpenCV support.
- Hedgehog is a low-level API.
- Dmitry: Have you looked at UPC++? Walid: Not yet, given focus on single nodes.

Tasks
- Memory management is per task
- State management is a task, on one thread. State may be shared via locks.
- Compute tasks only transform data

C++
- Templates everywhere, metaprogramming techniques
- No problem to use non-C++ functions. Had Fortran wrappers, can use extern C.
- Static analysis in C++20
- Questioned benefits of C interface, per benefits may be limited. Tim is open to enabling them but not doing it. Walid gave a no to Fortran. Alex: May not be possible to port to plain C. Walid: Would involve ripping some aspects and adding a runtime lib.
- CJ: Some customers have hard requirements for C ABI.
- Tal: Switching to generating C code for that reason. Template-heavy libs would be detrimental to what DaCe is doing. Can keep most code in a separate translation unit; separate compilation can be ok.
- Tal: Most users use Fortran, have an interop interface.
- Tim: Might be possible. All data is in a shared pointer that needs to be unwrapped. Can settle on well-defined types - restricted but workable.
- Tal: Releasing and taking pointers can be challenging. Tim: Memory that exits the graph may need to get wrapped. Tal: Graphs can get called more than once, have to manage that.
- SFINAE used only in mem mgr. Ack that it can be hard to interpret compile-time error messages.

Productization
- Streaming ops for signal processing. Fairly heavy compute bound work.
- A few folks at NIST
- Surprised that people are finding them.
- In discussion with other folks outside NIST.
- Graph analytics guys haven't approached them. May be too fine grained.

Layering on DFGs
- CJ: Explicit model of nodes and dependencies can be layered under a more implicit model?
- Tim: Using the data flow model explicitly is really fundamental.
- Walid: Can use DSL to infer DFG. No plans to create such a DSL.
- Tim: Historical project with image stitching. Tried traditional way with sequential code but support for async copying and computing was hard. Got 1.08x speedup with lots of work on a GPU. Generalized - Tim's HGS, Alex's Hedgehog. Providing tools for compile-time analysis. UMD looking at dataflow techniques to optimize thread loading and compile-time-identifying bottlenecks.
- Walid: Easier to compose
- Tim: type compatibility checking.
- Leonel: Can allow user to ID workflow, kernels from which graph can be built. BSC using pragmas and hints to create CUDA Graphs.
- Dmitry: How much does using this approach hurt comparability wrt other language interfaces? DF-based algorithm has to be expressed in C++ for tasks, although executable tasks can be in another language.

CUDA Graphs characterization update, Feb 18, 2020

Participants include (9)

Wilf, Szilard Pall, Tim Blattner, Piotr, Stephen, Vinay Amatya, Wael Elwasif, CJ

Mail problems
- Many people didn't get either an invite or an email
- Some (Szilard) but most others not getting MailChimp mails and calendar invites.
- Stephen: Outlook enables an iterative mail merge
- CJ: Jan event looked like a one-off vs. a series
- Wilf to pick a small set of people to come up with a workable solution

GROMACS - Szilard
- Thinks that a broader review could be of interest
- Has time available concerns in the next couple of months, similar for other teammates
- Would like an update from Alan/NV DevTechs working on this
- May be better to come back after some results
- Transitioning and modernizing a fairly large code base with the aim of encapsulating tasks and making them amenable to efficient task scheduling. So it could be the right time to enumerate requirements for expressing tasks and dependencies.
- Interested in what is possible and not

Usage model driven - CJ
- Identify leading examples, go deep
- First: high-level assessment of necessary features and challenges - can be done without a large investment, prerequisite that guides investment in next level of detail
- Next: low-level details of selected specifics
- Three levels: enabling of current features, incremental extensions to current features, big gaps that affect long-term architecture
- Others are interested in discussions on high-level requirements
- Interested in usage patterns so we can create worked examples for end to end solutions

Stephen on CUDA Graphs characterization
- Evaluation in the context of COMB public mini-app from LLNL
- Amdahl can kill concurrency
- Serial stream time is directly proportional to kernel length
- Some fixed overhead (15%) for multiple streams but mostly agnostic to kernel length
- CUDA Graphs incurs fixed overhead to set up graph, get config into the machine, worse than multiple streams
- CUDA Graphs reuse is a win over all of the above by at least 15%
- Graph Update reduces cost of updating parameters without changing the topology
- 33% execution overhead reduction vs. concurrency streams with Graph relaunch
- use double-pointer dereference to avoid changing parameters in the graph itself

Discussion
- Szilard on non-GPU tasks: haven't optimized CPU callbacks, which a future release targets
- Tim on traditional data flow: Processors aren't well suited for data flow. Many apps aren't really pure dataflow since they run consumer after producer is completely finished. Building a DF layer on top of this would be another future step.
- Tim has been trying to map into Hedgehog at NIH. Got 92% efficiency on GPUs. Perhaps combine CPU and GPU interaction via dataflow.

DaCe, CUDA Graphs, Jan 21, 2020

Participants include (18)

BillF, CJ, George, JeffL, Jesun Firoz, John Feo, Jose, Muthu, Naveen (HPE), Piotr, StephenJ, StephenO, Tal, Wael, Wilf

HiHAT Graphs on CUDA Graphs - slide preview: Media:HiHAT_Graphs_190916.pdf, Wojtek Wasko
- HiHAT Graphs working on top of CUDA Graphs.
- For now, CUDA Graphs APIs are called immediately and directly, vs. creating a graph independently and then instantiating it
- Can be generalized to support targets that don't support CUDA Graphs
- Can be extended to execute on remote targets
DaCe - Tal Ben-Nun
- Embedded C++ DSL for stateful parallel dataflow programming on stateful DFG
- Add state, including tasklets, arrays, map/exit for parametric abstraction, stream, consume/exist dynamic mapping, conflict resolution for writes
- Supports various accelerators
- Covers a variety of corner cases
- Includes a development environment that shows source, transformations, properties, generated code
- Perf
  - Favorable wrt MLK, Halide. Didn't unroll whole graph like Halide. Could have used systolic arrays.
  - 90% of CUTLASS
  - Know exactly what and when to copy - key advantage for FPGAs
- pip install dace
- Connections to HiHAT, CUDA Graphs
  - Currently using OMP parallel sections for CPUs, GPUs
  - Predates CUDA Graphs
  - Use HiHAT as a back end
- Questions
  - Analysis of space complexity?
CUDA Graphs in QMCPACK
- Application phases - variadic Monte Carlo with and without drift, diffusion Monte Carlo
- Problem: very many small kernels, seeking to reduce launch overheads and increase concurrency
- Solution outline: Jared's executors with CUDA Graphs graph capture
- CUDA 10.2 enables updates after capture -> instantiation cost only paid on structural changes
- Nsight Compute now supports graph visualization
- Perf: 1.28x overall, 1.09x from graphs alone, with some as a wash
- Lessons learned
  - graph update makes things significantly simpler
  - app had much more reliance on default stream then expected; need to avoid that for concurrency
  - error check often to understand where they happen
  - more at GTC

Graphs: HiHAT, QMCPACK, May 21, 2019

Participants include (22)

CJ, DavidB, DamienG, Dmitry, George, Hans, Jared, JeffL, Jesun Firoz (PNNL), JohnS, Jose, Millad, Oscar, Piotr, Siegfried Benkner (U Vienna, OCR), Stephen, Vinay, Wael, Walid, Wilf, Ying Wai Li

CUDA Graphs / HiHAT Graphs - CJ for Wojtek

"Update on HiHAT Graphs on CUDA Graphs"
Showed basic types and APIs
Have a simple POC working where HiHAT Graphs calls straight through to CUDA Graphs APIs to create nodes, edges, graphs, instantiate and invoke
In the next stage, we'll build up a graph in HiHAT and then instantiate or invoke all of it in CUDA Graphs
The HiHAT design is a little more general than current CUDA Graphs
- Supports

QMCPACK on CUDA Graphs: WIP snapshot - Jeff Larkin

"Use of CUDA Graphs in QMCPACK"
Moving a subset of kernels into CUDA Graphs to reduce launch latencies and increase concurrency
Used Jared Hoberock's Executors Prototype, which is on github
- standard CUDA launch vs. executor launch
- kernel functions vs. function objects
- native graphs vs. executor graphs
Now leveraging graph capture
Challenges
- Poorly optimized streams
- Lack of parameter capture
- Callbacks are expensive, can only pass back a void*. Have to do a deep copy.
- Host lambdas didn't always work with executor graphs when merged with other graphs
  - Synchronous execution may make memory lifetime management less of an issue, but as things move to be async, lifetime management can become an issue. This can be an issue especially wrt stack vs. heap memory usage.
  - The executor prototype garbage collection upon graph destruction may have issues.
  - C++ lambdas may capture more temp variables than the programmer realizes.
  - Jared is working on a revision of the prototype and may look at this problem.
  - CPU callback nodes in CUDA Graphs weren't built to be very sophisticated, Stephen is now interested in having another look. May need some auto tracking of lifetimes with a system like Jared's.
  - Elimination of data movement between GPU and CPU was a nice side benefit.
Status
- Significant effort went into getting rid of CPU dependencies
- Some tools issues got reported
Results
- Some gains without reuse: 9%
- Recapture API isn't released yet; experimenting with that, expecting instantiation overheads to improve
Future work
- Make graph fatter, which will make better use of GPU
- Planning to finish backporting kernels to native graphs and executors
- Expanding scope
Lessons learned
- Avoid host nodes for now
- Function overloading may break
- Watch out for copies to/from unpinned buffers - undefined behavior
- Jared's abstractions were much friendlier to write in
Feedback
- Dmitry: Consider capturing by value
- Dmitry: Consider reference counting

Graphs and interoperable building blocks, Apr 23, 2019

Participants include (24)

Antonino, CJ, Dmitry, Hans, Himanshu, James, Jeff, Jesun, Jiri, Jose, Marcin, MichaelG, Millad, Piotr, Siegfried, Stephen, Vinay, Walid, Wael, Wojtek, Ying-wei and others

"Toward Common and Interoperable SW Infrastructures - ECP Annual Meeting report out", George Bosilca, Mike Heroux and CJ Newburn

Increasing interest in sharing common and interoperable infrastructure under mashups that cross app domains
17 projects from DoE shared what primitives and services they want
Poll indicates fairly strong (theoretical) interest in sharing

CUDA Graphs update - Stephen Jones

Based on CUDA 10.1
- For 3 2us kernels, there's a 53% overhead in the kernels-only case
- For the same sequence, reducing launch cost with reused graphs drops overhead to 46%
- And device-side execution overhead reductions get you to 37% overhead; that's a net 26% reduction
- With graph relaunch, CPU speedup can be 7+x and GPU-side speedup can be 1.4x for a straight-line graph
- Overheads per kernel launched: 2.1->0.29us in launch times, 1.57->1.11 per kernel
- Embedded mobile inference (Tu104): 6-11x CPU-side launch, .95-3x GPU execution side
Can use cudaStreamBegin/EndCapture and replay. Must be replayable. Capture doesn't actually run the captured code. This is different than tracing. Also captures parameters to kernels, size of memory copies, arguments to functions.

Graphs and tasking, Mar 12, 2019

Participants include (33)

Wilf, Jared, Jose, JohnB, JamesB, JohnS, DavidB, Himanshu Pillai (ORNL), JeffL, JoshS, KamilH, Roman, MikeC, MilladG, OscarH, PiotrL, StephenO, TomS, VinayA, WaelE, WalidK, YingWai Li (ORNL), Ashwin Aji, Paul Besl, SPuthoor, Jesun

Executors on CUDA Graphs, Jared Hoberock
- Explained theory, approach
- Showed sample implementation and results
- Described opportunities and limitations
- Open sourced the sample code here
"OpenACC CUDA Graphs", James Beyer
OpenMP tasking directions, Tom Scogland and James Beyer
- Looking at leveraging and working with systems like this, in the context of OpenMP
- Want OpenMP to work with C++ executors and graphs
- Replaying graphs may involve a significant refactoring of code outside of a loop to capture and replay
- Working at creating an interface that enables the creation of a reusable task
- Need to work out how to pass in parameters
- Graph construct might contain regions for OpenMP and OpenACC, they might do the update for us as long as it has the appropriate hooks.
- James: use of side table to track important info
- CUDA Graphs API can be mapped to by a compiler, but with limited control. The executor approach may lend itself to providing more control.
Report out on ECP Breakout, "Toward Common and Interoperable SW Infrastructures", |George Bosilca, Mike Heroux and CJ Newburn - deferred to next time

Hierarchical decomposition, Oct 16, 2018

Participants include (21)

BillF, CJ, DavidB, Dmitry, HansJ, James, MarcinZ, Mehdi, MikeB, Oscar, Piotr, StephenO, Swaroop Pophale, Vinay, Wael, Wilf, Wojtek, Ying Wai Li

Graph gen

SLATE now
May be relevant for BMRG

Data partitioning

ExaTensor
Legion implements user decisions
PNNL

ExaTensor - Dmitry Liakh

Recursive tensors
Depth of recursion controlled by user, uniform across targets, stops at node level now
Deep recursion, for hetero nodes, not currently supported; HiHAT could be relevant there
Data partitioning function - single, provided by app, user adjustable parameters and filters
Work creation function - single, provided by app, guided by existing recursive data partitioning
Graph generation function - no explicit graphs now

Adaptive Mesh Refinement - Mike Bauer

Principle: only model what matters - unstructured or structured but hierarchical meshes
Background on AMR
Hierarchy of data, computation, SW
Lends itself well to recursion
Mapping to machine resources is difficult

C++ Wrappers for HiHAT, Sep 18

Marcin Copik, ETHZ

Participants include (15)

David Bernholdt, Dmitry Liakh, Ferrol Aderholdt, James Beyer, Jose Monsalve, Marcin Copik, Millad Ghane, Muthu Baskaran, Wael Elwasif, Wilf Pinfold, Wojtek Wasko, Ying Wai Li, CJ

C++ wrappers for HiHAT

Not trying to cover all of HiHAT initially, WIP
With Tal Ben-Nun, who is traveling this week
Goals: simplify - cleaner, more robust, human readable; namespaces, RAII; analogous to SYCL vs. OpenCL
Namespaces
- hh:: and hh::experimental but not hh{u,c,n,h,e}
- maybe nesting, like hh::graph
- distinction between user and common layer? Encapsulation, ease of user, but also bare bones for performance. Fine to start with user layer and extend to common layer.
Wrapper classes
- No error handling
- Wrap C objects, query members, get it directly; optimized for most common scenario, e.g. defaults for common parameter values
Graph API
- Walked through sample codes for C API and corresponding C++ API
- Introduced | and & operators to capture serial or parallel dependence relationships among nodes -> easy to specify and read
- Overload function call operator, enhance simplicity for static case, to complement dynamic forms
Enum-based flags
- classes offer implicit scope, static type checking
- but could lead to less-robust interfaces if Enum compression (compiler switches) introduces incompatibilities wrt struct and parameter sizes
- Enum flags are often OR'd together as an int, maybe provide implicit conversion operator inside Enum class to avoid lots of typing
- Mitigated by having a header-only library, maybe make the change to use Enums in C++ only
Error handling
- C API use error codes, use HIHAT_CHECK macro to handle errors
- C++ options
  - return only error codes - what if ctor fails?
  - exceptions - every function has to throw for consistency
  - return both value and error codes, which requires expected-like implementation
- Marcin and Tal: prefer exceptions
- MichaelG: adding exceptions to libs can cause problems later, but would be highly undesirable if a device-side interface were ever to be added
- CJ: it's conceivable that a HiHAT instance could some day run on a less-capable device like a GPU, in CUDA. I'd urge caution about precluding that.
- Dmitry: Can we plumb a C++ user function that wants to use C++ exceptions through a C API? He'll consider this, maybe offer an example.
Minor suggestions
- Sync clean-up: more-comprehensive destruction, e.g. collapse destroy, sync, free with async postComplete
Questions
- Preferred way of handling errors? Proposed exceptions, but want to consider broader retargetability. Need to work through examples with C++ implementations.
- Enums vs. ints in function parameters? Yes, in C++
- Header-only on top of HiHAT C lib? Ues
- How do we make programming easier? Graphs, shortcuts
- Target only common usage scenarios? Yes
Intending to provide a prototype implementation
We'll point more folks to this for broader review

Graphs: HIHAT and DMGR++, Aug 21

Participants include (22)

Arghya Chatterjee and YingWai Li and Oscar Hernandez (ORNL), CJ, Dmitry, JamesB, Jeff Larkin, Jose Monlsave, Marcin, Michael Garland, Mike Bauer, Monzan, Stephen Jones, Stephen Olivier, Szilard, Tal, Walid, Wilf, Wojtek, Wonchan

Graph API overview, Presentation

CJ Newburn and Wojtek Wasko
Stephen: Does support for concurrent launch belong in this interface? CJ: Maybe, but perhaps that's just the responsibility of the HiHAT client above.
Stephen: How about making sure that senders and receivers are ready, in case there's a large delay? CJ: Seems like this should be something that's required of implementations rather than the dispatch architecture, and relegated to the trait system.
Dmitry: what about critical sections? CJ: Sounds like OpenMP, wanting to work with Tom Scogland and others on OpenMP mapping.
Tal: Been working with HiHAT in their context, want to propose a C++ interface. They posted an example to Google drive
CJ: Several apps/runtime folks meeting weekly on support for graphs; feel free to join us and get access to detailed work on that

Graph characteristics for DMRG++, Presentation

Arghya (Ronnie) Chatterjee, Yingwai Li, Oscar Hernandez - ORNL
Dense Matrix Renomalization Group (DMRG++)
Graph characteristics
- No dependencies among cells, but dependencies among tasks for reduction
- Graph may change over time or may depend on dynamic data
- Load imbalance across patches within matrices, gets worse in latter phases. OpenMP fork/join/barrier overhead can get horrible. Need a better way to manage dynamic scheduling; needs more investigation.

Dynamic tasking and HiHAT Graphs, July 17

Participants include (23)

CJ, David Hollman, Stephen, Oscar, Charles Jin, Damien Genet, David Bernholdt, Dmitry, Andrew, Antonino, Jeremy, Jesun Firoz, JohnB, Jose Monsalve (UDel), Millad, Sunita, Swaroop Pophale (ORNL), Vinay, Walid (NIST), Wilf, Muthu Baskaran

Requirements and design for HiHAT Graphs

Dynamic graphs

Dmitry, DavidH, Andrew, ...
Cases for dynamic addition
- Prior to instantiation
- After instantiation - continuation with same resources
- After instantiation - follow on with new resources
Cases for dynamic selection
- Superset of nodes on superset of resources
Discussion
- Static - known before instantiation vs. dynamic - structure not known before instantiation vs. semi-static which have multiple alternatives
- Andrew: May create a CSR matrix, stop and do an update, create new CSR and continue
- Dmitry: Additional stage for partial instantiation?
- CJ: Could/should node creation, instantiation and invocation be fully async so that a later phase can be pipelined
- Stephen: instantiation has computational complexity to O(# nodes); invocation has O(1 graph);
- Stephen: what if different iterations induce different subgraphs?
- Sunita: can partition a matrix into (tiles), may have special treatment for leading diagonal - dependencies across wave front; well explored for FPGAs, where different modules are created. Consider Smith Waterman, which is well implemented in CUDA but not with tasks.
- DavidH: Still thinking about cost trade-offs, e.g. can get exponential explosion with unions. In situ graph modification vs. up front "unionizing." Like MPI+X's two-layer programming model layered across different HW where the ability to express generically and make trade-offs in compiler or runtime is needed
- CJ: Could create a template subgraph, instantiate it, and generalize the template subgraph and its . DavidH: Jonathan Liflander had an SC18 paper submission on dynamic caching of common subgraphs.
- Jesun: level correcting algorithms may not wait for all predecessors before starting execution and they may iterate on updated until signaled to terminate; would that be supported? If it's a DAG, you can't get deadlock. He'll forward some more notes on this.
- Jesun: Termination determination is another concern. Trade-offs between managing that by runtime or app developer. Level-based algorithms always proceed forward. With general async, one can go back and forth across levels. He's willing to add some content to the Modelado site.
- JohnB: For creating more work from GPU, you could create an additional CPU node from which the Graph APIs could be called.
- Dmitry: Or this could be done with a CUDA callback.

Report out on C++ Layering on GPUs workshop, June 19

Participants in this June 19 phone meeting include (16):

Andrew, Antonino, CJ, Dmitry, Gordon, Jared, Jiri, Max, Michael, Mike Chu, Millad, Stephen, Tim Blattner, Walid, Wilf, Wojtek

Workshop participation (24 local, 7 remote)
- CSCS of Switzerland, German/Switch Universities, DoE Labs, NVIDIA, Codeplay
Overview
- Looked at how to layer C++ on CUDA Graphs, plumbing down through executors/futures, HiHAT
- Reviewed in light of application requirements
- 9 groups indicated interest in contributing code to this collaborative effort
Points of agreement
- Graphs are an abstraction of interest. It looks like graphs can be built up using the (revised) proposals for executors and futures. We need to work through examples to build confidence in this. Different executors will be necessary for static building or dynamic building of graphs.
- CUDA Graphs look interesting enough to try. Of particular interest are lowering overheads in support of fine granularity, graph reuse, graphs with control flow in them. We should collaborate on creating a set of proofs of concept implementations of executors in support of CUDA Graphs.
- Some runtimes implement primitives for several targets and may benefit from HiHAT. They can try out HiHAT to see if it provides ease of use, simplicity, performance and robustness. Those interested can “spend a day” identifying a candidate use, studying the API doc and doing a trial run with HiHAT.
CUDA Graphs
- Benefits: reducing overheads for small tasks, repeated graphs. Resource management, dynamic control flow, CPU-less interop. Improved GPU utilization.
- Workload and framework reps rated various aspects of these benefits
Flow
- Create one or more graphs consisting of actions as nodes, with dependencies
- Partition into subgraphs by target; bind vertices and order them
- Augment with additional vertices as required to manage memory, data movement sync
- Augment with additional vertices for interactions among graphs
HiHAT
- Can wrap CUDA Graphs, support multiple targets with potentially fewer restrictions and better affinity
- Enable portability for handling graphs as a whole, not just a collection of vertices
- Enable interactions among graphs, for same or different targets
- Working on APIs to wrap graphs
Discussion
- Andrew: There's some consideration of updating the interfaces for Boost Graph Library to a more modern version of C++. Could be beneficial to have some 2-way communication about this. Andrew to follow up.

C++, CUDA Graphs, May 15

Participants include (32):

Andrew, Andrew, BillF, Carter, DavidB, Damien, Dmitry, Walid, George, Hans, Hartmut, Jesun, Jiri, JohnB, JoseM, Louis Jenkins, Manju, MichaelG, MikeB, Millad, MarcinZ, Piotr, Siegfried, StephenJ, Szilard, Umit, Wael, Walid, Wojtek

C++ Directions - Michael Garland

Beyond parallelism to async, data coordination
Key ingredients
- Identify things: pointers, iterators, RANGES
- Identify place to allocate storage: allocators
- Identify place to execute threads: executors
- Identify dependencies: FUTURES
- Identify affinity: conforming INDEX SPACES for threads and data
C++17 parallel algos
- for_each(par - execution policy, begin, end, function)
- NVIDIA's Thrust library, for CUDA C++
Executors
- Multiplicative explosion between diverse control structures and execution resources
- Mediate access with a uniform abstraction
Asynchrony
- Async keyword
- Chaining - maximize flexibility, composability, perf [interoperability]
- Wish to not hide dependencies, to not inadvertently bind code execution
- Permits more implentations, including HW mechanisms and construction of graphs
Developments
- Generalized actions: invocation, mem mgt, data movement, sync
- Standalone ops --> predecessor actions in context --> description of input data
- Async/deferred vs. immediate

CUDA Graphs - Stephen Jones

Forthcoming feature in CUDA + some research ideas
Want more insight? Have questions? Let us know so you can join follow up sessions with more detail.
Graphs
- Per node, any GPU or CPU, fan-in/out to any degree, multiple root/leaf nodes
- Provide more context (semantic, resources) toward HW
- Define -> instantiate -> execute
Concurrency
- More concurrent than streams, which are used for ordering with other work
Reduce overheads
- Invoke many actions vs. one - O(us)
- Reduce kernel-kernel latency - O(us)
- Building, e.g. resource binding, bookkeeping can be done offline
- Avoid centralized bottleneck as processing is distributed to targets
- Most relevant for many small kernels
Can't do with this
- Automatic placement; best choice may depend on data locality which is not known at execution layer
- Only execution vs. data dependences
- No splitting or merging of graph nodes
Can do with this
- Rapid re-issue
- Hetero node types
- Cross-device dependencies
Discussion
- JohnB: CPU code on GPU? Pre-bound, but flexible.
- Dmitry: dependence management
- Wojtek: HiHAT is retargetable by design, wraps the pluggable implementation which is CUDA Graphs. Supports interoperable implementations of sync, data movement in ways that support Michael's signal, evaluate, schedule, launch stages

DARMA: A software stack model for supporting asynchronous, data effects programming, Apr 17

Presenter Jeremy Wilke, Sandia

Attendees (31) included

Ashwin Aji (AMD), Benoit Meister, Carter Edwards (NV), David Bernholdt (ORNL), DamienG, DmitryL (ORNL), Hans Johansen (LBL), James Beyer (NV), Jeremy (Sandia), John Biddiscombe (CSCS), Kamil Halbiniak, Marcin Zalewski (PNNL), MauroB (CSCS), Michael Garland (NV), Michael Wong (Codeplay), Millad Ghane (Houston), Muthu Beskaran, Oscar (ORNL), RonB, Ruyman Reyes (Codeplay), Szilard, Wael Elwasif (ORNL), Walid Keyrouz (NIST), Wilf, Wojtek Wasko (NV)

Jeremy's talk

Express tasks with flexible granularity
- Elastic tasks
- Breadth first or depth-first
Relevant apps
- Dynamic load balancing: Multi-scale physics, PIC, tree search, AMR with fast shockwave
- Semi-static load balancing: PIC load balancing, block-based sparse linear solvers with irregular sparsity, AMR with slow shockwave
- Static, flexible granularity: tile-based linear algebra, FE matrix assembly, complex chemistry
Data effects
- extract concurrency, focus on locality and granularity (size, shape, boundaries)
- Permissions for immediate and scheduling: modifiable, read only, none
Metaprogramming within C++
- C++ wrapper classes AccessHandle<T>
  - Like a future, but no blocking get method
  - Required for dependence analysis
- Capture
- Task creation functions
Debugging
- Shared memory only vs. distributed, sequential
Layering/backend
- Charm++ (tested at scale), MPI/OpenMP (POC), HPX (LSU/Hartmut POC and Thomas), Kokkos (WIP), std::threads (done)
C++ activities
- Executors, futures, span/mdspan, atomics, deferred reclamation through hazard pointers and RCU
Induced requirements, layering
- Lower layers handed dependencies in a DAG
- DARMA data structures are operands of tasks

Discussion
- JohnB: how do permissions this go beyond C++ const? J: We use const
- JohnB: Observation - the richer the scheduler, the greater the complexity
- OscarH: can tasks also be data parallel, e.g. OpenMP? J: Yes. Significant engineering problems with nested parallelism. Express a cost/perf model for elasticity of tasks, not fully defined for DARMA.
- CJ: hierarchical approach? J: Left up to lower layers, as guided by perf models. Both DARMA and Kokkos have parallel_for. Lower layers do binding and ordering.
- CJ: DAG must be materialized in its entirety? What about dynamic task generation? J: Yes.

Asynchronous operations support in HiHAT, Mar 20

Presenter: Wojtek Wasko, NVIDIA

Attendees (24) included

Andrew, Ashwin Aji, BillF, CJ, Dave Bernholdt, Dmitry, Gordon, Jesun Firoz, John Biddiscombe, Jose Monsalve (UD), Marcin Zalewski, Marcin, Mauro, Michael Wong, Mike Bauer, Piotr, Szilard, Umit, Walid Keyrouz, Wojtek Wasko, Wael, Wilf

HiHAT Async Operations
- Background
  - Reviewed actions, action handles, sync objects
  - ActionHndl owned by HiHAT dispatch layer
  - SyncObject and its format, management, interaction owned but pluggable implementation
- Requirements
  - Action handles
    - Used to link according to dependences
    - Can logically combine (and/or) multiple ActionHndls into a single result - optimize for the common case of 1 reaching input dependence
    - Can query state
    - Can obtain underlying (post-dominating) sync object; debate: blocking or non-blocking wrt pluggable implementations
  - Sync objects
    - Based on object's description or type
    - Must be able to completely bypass HiHAT and enable direct communication with legacy code
- Samples
  - Provided in a CUDA context; we welcome examples and contributions for other architectures
- Discussion
  - Support full async, e.g. in querying sync object that may not have been provided by underlying pluggable implementation?
  - Can support both blocking query for sync object and non-blocking "is it available yet" API
  - MPI erred on the side of adding more APIs, and had both blocking and non-blocking APIs
  - Can always have a blocking implementation
  - Preference was for a richer API
  - Wilf: can alloc, movement, invocation, loading code in a deferred fashion? We might wish to not over-strain memory capacity for code and data, for example. CJ: Yes, everything is async and the underlying pluggable implementation makes those things happen when ready. There can be a backpressure at the enqueuing time, but we don't have a sample implementation of that yet.

Upcoming community activities
- Workshop at CSCS in Zurich June 10-11. Likely focus is layering for async tasking, e.g. C++/HPX/maybe HiHAT/CUDA Graphs. Let us know if you're interested. John Biddiscombe of CSCS is hosting.

Hierarchy, Part II Feb 20

Attendees included

CJ, John Stone, Antonino, Gordon, Kamil, Roman, Marcin, Mauro, Michael Garland, Michael Wong, Millad, Piotr, Siegried, Szilard, Walid, JamesB, Mauro, AndrewL, David Bernholdt, Dmitry, Jose, Oscar, MarcinZ, Stephen, rfvander, manz551, Wael Elwasif, Umit Catalyurek, John Biddiscombe, Wilf, Ashwin

Recent community activities
- LifeSci Tech Summit at NVIDIA - discussions about CUDA Graphs
- OpenMP - interest in implementing HiHAT under OpenMP
- CSCS - considering a workshop on layering infrastructure under C++, e.g. HPX / HiHAT / CUDA Graphs. John Biddiscombe extended an invitation to help prepare for that.
- ECP Annual meeting - HiHAT was mentioned a couple of times as a possible candidate for underlying infrastructure, several side meetings, including for memory and storage interfaces and OpenMP

Possible approaches for hierarchical dense linear algebra kernels (Vivek)
- Express code simply - flat or polyhedral frameworks
  - AndrewL: is template metaprogramming considered simple? Vivek: was focused on representation vs. linguistic syntax, which is orthogonal
- Explore space of data distributions - hierarchical, block/cyclic, AoS/SoA
- Explore hierarchical decompositions - async, hierarchical place trees
  - Can place onto various places in the hierarchical tree, thereby separating semantics from perf tuning
- Explore affinity hints/declarations - temporal locality

ExaTensor: hierarchical processor of hierarchical tensor algebra - Dmitry Lyakh
- Hierarchical decomposition
- Replicate with prefetch
- Currently focused on block sparse, hierarchical, with no predefined regular pattern -> heuristics and predetermined data mappings

"Longing for Portability of Performance" (Weather/climate) - Mauro Bianco
- Single source code
- Stencils with complex dependencies
  - 10s-100s per time step, fairly big tasks but don't necessarily fill a GPU, more than one thread
  - conditional execution, limited halo lines
- Data parallel and task-oriented
  - From a dev point of view
  - User-managed granularity is not portable - auto *splitting* isn't generalizable
  - Aggregation seems more applicable - thread switching -> function calls, inlining. Still not universally optimal, e.g. parallel scan
  - Express the *finest granularity*, else function call too big, coarsening with inlining is possible. Inspector/executor model? May need new *algorithmic patterns/motifs* to make data parallel, e.g. with a keyword.
  - CJ: what are the prospects for expressing parallelism and data in a hierarchical way to a scheduler which can decompose as needed, vs. reactively? Limitations wrt particular application domains? Mauro: Sequoia tried to do some of this; limited progress. Vivek: compiler support helps with granularity, runtime helps with the mapping; expressing at fine-grained level

Looking forward
- Transition from static to dynamic

Data management interfaces: What we have now, user requirements, what we need, Jan 16, 2018

Attendees included Wilf, George Bosilca, Gordon Brown, Jeff Larkin, Millad Ghane, John Stone, Jose Monsalve, Damien Genet, Michael Garland, Stephen Olivier, Bill Feiereisen, James Beyer, Antonino Tumeo, David Bernholdt, John Feo, Marcin, Szilard Pall, manz551, Siegfriend Benker

Presentation

Interest in enumeration of resources: Gordon @ Codeplay

Feedback

PNNL has a project for HIVE on Exascale platforms that may include GPUs. John Feo: Content here is relevant for scheduling, data marshaling.
Umpire/CHAI guys not here today, but evaluating that
Multi-dimensional support (George): may be done above the HiHAT layer, these interfaces seem adequate
What about thread safety? (George)
- Often use memory contexts that are shared among threads on the same socket
- HiHAT clients (above) and coordinate between what's been registered, e.g. thread safe implementation, and what that client wants
New implementations could be registered dynamically

Hierarchy, Nov 21

Attendees, included: James Beyer, Jeff Larkin, Max Grossman, Vivek Sarkar, David Bernholdt, Dmitry Lyakh, Michael Garland, Mike Bauer, Piotr, Ruyman Reyes, Siegfriend Benner, Stephen Olivier, Szilard Pall, Wael, Elwasif, Wilf, Kamil Halbiniak & Roman, Millad Ghane, CJ Newburn

Vivek Sarkar, GA Tech
- Habanero, CnC, OCR
- Places - can distribute data, can use type system to distinguish between local/global
- Locality-aware scheduling using hierarchical place tree
  - Different abstractions for diff HW, e.g. diff # levels and kinds of memory
  - Affinity annotations - can express preferences vs. hard assignments
  - Can pass abstract work down the hierarchy, do work stealing at lower levels
  - Supports spatial and temporal sharing
- Undirected graph, not just a tree; use trees where profitable
- Multiple levels of parallelism and heterogeneity
Dmitry Lyakh, ORNL
- ExaTensor, a distributed tensor library based on hierarchical data representation
- Adaptive dynamic block-sparse representation of many-body tensors
  - Resolution of each block is dynamically adjustable
- Recursive definition of storage, computational resources.
  - Data centric - induces task decomposition. But tasks can follow data and be aggregated. Data storage granularity and task granularity are decoupled.
  - Computational resources are encapsulated as virtual processors - can do linear algebra, tensor ops
- Targeted for Summit
Mike Bauer, NVIDIA
- Legion
- Adaptive mesh refinement, algebraic multigrid
  - Very dynamic behaviors - partitioning changes, data created and destroyed at runtime, depends on domain-specific knowledge
  - Folks at LANL now working on AMR in Legion
- Hierarchy
  - Levels that correspond to levels of details; may correspond to different data structures
  - Hierarchical decomposition of the same data structures into different levels
- Requirements
  - Want primitives for describing partitions efficiently and effectively - can be error prone
  - Capture descriptions in DSL that apply across many different kinds of apps
Discussion
- Provisional/on-demand decomposition
  - ExaTensor: User specifies work to do in high-level DSL. Automatic decomposition, e.g. based on maintaining arithmetic intensity, data transfer bandwidths. Not much to trade-off since limited by DSL. Expected a need for further generalization as the scope of apps is expanded.
  - Legion: Often 2 ways to decompose - 1) breadth, across nodes or processes within node, 2) depth, e.g. NUMA, across GPUs or SMs. Both may need to change dynamically and that needs to be efficient. No one size fits all. Apps describe the partitioning algorithms. Mappers pick the best decomposition for a given target. Tunable parameters are specified to the mapper.
  - Vivek: dynamic code generation has a part to place; specialize to runtime-determined data distribution; has a student experimenting with the on CNN for inner loop bodies wrt data characteristics
- Follow up
  - Pick some concrete examples and work them through
  - Trade-offs based on cost models, e.g. recompute vs. fetch halo data
    - MikeB: Recomputing could make code complexity high, hard to maintain. Could be relevant for Halide.
    - Dmitry: compression could be a factor
    - Jim Demmel @ Berkeley: Communication-avoiding algorithms

Comparison SyCL and HiHAT; Supporting Hetero and Distributed Computing Through Affinity, Michael Wong, Oct 17

Attendees, including:

Michael Wong, Andrew Lumsdaine, Carter Edwardsd, Jiri Dokulil, Kath Knobe, David Bernholdt, George Bosilca, Gordon Brown, Jeff Larkin, Marcin Zalewski, Mauro Bianco, Max grossman, M Copik, Michael Garland, Piotr, Ruyman Reyes, Siegfriend Benner, Thomas Herault, Wael Elwasif, Wilf, rfvander, Stephen Olivier, Szilard Pall, Wojtek Wasko, James Beyer, Sean Treichler, Bill Feiereisen, Dmitry Liakh, Jesun Firoz, dgenet, Millad, OscarH, Ashwin Aji

SYCL
- Retargetable from C++, entirely standard C++ (no keywords, no pragmas)
- Single-source host and device compilation model
- Separate storage and access of data, specify where data stored/allocated
- Task graphs
- C++ lacks ABI, have to provide symbol name for kernel, could be improved with static reflection
- Several apps on top of SYCL now: vision for self-driving cars, ML with Eigen and Tensorflow, parallel STL with ranges, SYCL-BLAS, Game AI
- Comparison with Kokkos, HPX, Raja
  - All: C++11/14, execution policies to separate concerns, shape of data, aiming to be subsumed by future C++
  - SYCL: mem storage vs. data model, dependence graph, single source/multi-compilation
  - Kokkos: mem space, layout space
  - HPX: distributed computing nodes, execution policy with executors
  - Raja: IndexSet, segments
- Proposing C++2020 hetero interface
- SYCL for HiHAT
  - SYCL aligning more with C++ futures/executor/coroutines
  - Exploring HiHAT vs. OpenCL as low-level interface, layered below ComputeCpp, which is target agnostic
  - Plug in binary blobs for vendor-specific components
  - Async API
  - Enumeration of device-specific capabilities
  - Time-constraint ops, for safety critical SYCL
    - In context of safe and secure C++
    - Removing ambiguities and undefined behaviors
    - Componentized, multi-layer, well tested
  - Could use for alloc, copy, invoke
- Codeplay biz and HiHAT
  - Licensing, IP protection
  - HiHAT in their stack
  - Certification for HiHAT-compliant devices/implementations

SC17 BoF on Distributed/Hetero C++ in HPC
Workshop on Distributed/Hetero Programming in C++, IWOCL, Oxford
C++ P07986r0: Support Hetero and Distr Computing Thru Affinity
- Resource querying - dynamic? hwloc, which is primarily hierarchical?
- Binding and allocation
- Affinity - relative? migration?
- CJ: Please extend to support an async interface to allocation and affinitization
- Wilf: app developer manages policies like affinity?

Recent developments, profiling, OpenMP, enumeration and other features, Sep 19

Attendees, including:

Jeff Larkin, Francois Tessier, Ronak Buch, Marc Snir, Piotr, Samuel Thibault, Wael Elwasif, Wilf, Carter Edwards, John Stone, Benoit Meister, UD CIS, Marcin Zalesski (PNNL), Walid Keyrouz, Firo, Sunita Chandrasekaran, James Beyer, Stephen Olivier, Galen Shipman, Hartmut Kaiser, Jesun, Jiri Dokulil Stephen Jones, Mike Bauer, Millad Ghane, Siegfried Benner

Recent developments
- Repo with CLA being readied at Modelado
- Good feedback and interactions at DoE Perf Portability Workshop in August, led to more of a common view, more collaboration
- HiHAT paper accepted at (Post-Moore's Era) PMES workshop at SC17
- Expecting a 2-hour meeting on HiHAT at SC17 to share progress, usage and plans for HiHAT
- Affinity BoF @ SC17 (Emmanuel Jeannot) looks to be relevant; let's plan some pre-work
Profiling (Samuel Thibault)
- What can be profiled, what can be done (callbacks now, counters coming)
- Request: share input on additional states that should be profiled
- Walked through an example with different allocations, copies, invocation on CPU, GPU
- Showed integration with StarPU that uses Paje.trace and vite from its tool suite
OpenMP above and below HiHAT (James Beyer)
- Use omp parallel for inside a task. Can warm up an OpenMP hot team with a separate invocation off of the critical path.
- Replacement of GOMP_parallel with HiHAT trampoline - HiHAT-based scheduler for improved retargetability
- OpenMP affinity - progressing from HiHAT ignoring it, being informed about it, and invoking from one set of threads to another set
- Resource subsets
- Looking at trying this out in Clang
Enumeration and other features in HiHAT (CJ Newburn)
- Currently using inlined source code with static initializers, one for each kind of platform
- Integrating with hwloc to automate that. Will support NUMA nodes, CPUs and memories that hang off of them, accelerators (FPGA, GPU, DLA, PVA, etc.) and their memories
- Plugin of target-specific implementations
  - Each API, at user and common layer
  - Allocator, at each memory instance
  - Selected by resources (e.g. copy endpoints) and a chooser
Solicitations
- Language layering, e.g. task implementations
- Memory management, e.g. allocator tradeoffs - per-thread nurseries, same-size allocators, etc.
- Thread management, e.g. Qthreads, Argobots
- Client layering, e.g. SHMEM

StarPU and HiHAT; HiHAT design ideas - Aug 15, 2017

Attendees (33), including:

Andrew Lumsdaine, Antonino Tumeo, Ashwin, Benoit Meister, CJ Newburn, D Genet, George Bosilca, Gordon Brown, James Beyer, Jose Monsalve, Kath Knobe, Marc Snir, Max Grossman, Michael Garland, Millad Ghane, Naoya Maruyama, Oscar Hernandez, Pall Szilard, Piotr, Ronak Buch, Samuel Thibault, Siegfried Benkner, Stephen Jones, Stephen Olivier, Sunita Chandrasekaran, Thomas Herault, Wael Elwasif, Wilf Pinfold, Wojciech Wasko

StarPU, Samuel Thibault, INRIA
- Supports sequential semantics, general retargetability, out of core, clusters, memory consumption control
- Can simulate - as if performance
- HiHAT wish list
  - Low-overhead interface to HW layers
  - Reusable components - perf models, allocators, tracing, debugging
  - Task-based interface, plus OpenMP, helpers for outlining and marshaling
  - Interoperability
  - Shared event management - user-defined events, interop with what's not covered by HiHAT
  - Memory alloc - uniform low-level API, efficient sub-allocator, same-size memory pools, hierarchical balancing
  - Disk support: store/key/value
  - Prioritized actions
- Some responses
  - Very good alignment with HiHAT
  - Pluggable implementations can provide memory management; tuner can provide an allocator per memory device, or share them
  - Event system will provide abstractions over implementation-specific events and semaphores in memory
  - Simulation could be an interesting service above HiHAT
    - Something above HiHAT is in control
    - Emulation/simulation above HiHAT acts as is the actions really happened
- Possible joint investigations
  - Whether anything special is needed for specialized allocators, memory load balancing
    - Interested: Samuel, Marc, Benoit, George, Ashwin, Max
    - Compare with MPI allocators that deal with fixed size and conflict avoidance, consider progress thread implications, developments in MPI 4 endpoints (Marc Snir)
  - Tracing formats and debugging
    - Interested: Samuel, Max
- Implementation and interface exploration
  - Prioritized actions
  - Integration of MPI wait with action dependence system
    - Interested: Marc, Jesus, Samuel, Andrew, George
- Discussion
  - How do you handle running out of memory? See paper on memory control on StarPU website. Increase the granularity of what's submitted for execution.

HiHAT design teasers, Sean Treichler and CJ Newburn
- Going stateless
- Resource handling
- Memory abstraction and traits
- Execution scopes
- Please send mail to cnewburn@gmail.com if you're interested in more offline discussion/presentation on these topics.

More users, proof of concept plans, high-level design doc, June 20, 2017

Attendees (43): Antonio Tumeo, Ashwin Aji, BillF, David Bernholdt, DebalinaB, DGenet, Dmitry Liakh, Firo017, Gordon Brown, Hans Johansen, James Beyer, Jesun (PNNL), Jiri Dokulil, John Stone, Kamil Halniniak and Roman, Kath Knobe, Keeran Brabazon (ARM), Mauro Bianco, Michael Garland, Mike Bauer, Mike Chu, Millad Ghane, Minu455, Naoya Maruyama, Piotr, Rob Neely, Ronak Buch, Ruyman Reyes, Samuel Thibault, SharanA (NVIDIA Tegra), Siegfried Benkner (U Vienna/StarPU), Stephen Olivier, Szilard Pall, Thomas Herault, Tim Blattner, Vincent Cave, Wael Elwasif (ORNL), CJ, ...

Some new participants: NIST/HTGS, UINTAH, StarPU, more from NVIDIA, e.g. automotive
- Tim Blattner presented slides (see Presentations)
Proof of concept
- Review of POC plan doc (see Presentations)
- John Stone, VMD and molecular orbitals
- Some discussion of the benefits of dynamic scheduling
- There's a value to progressive back off on the dynamism of scheduling, potentially based on profile-driven need - John, Szilard, Wilf
High-level design doc (see Presentations)

Mini-Summit Synthesis, May 16, 2017

Attendees

Carter, David Bernholdt, George, Max, Michael Garland, Michael Robson (PPL/UIUC); Millad, Patrick, Piotr, Thomas Herault, Dmitry, Szilard, Toby, Wael, Damien, Andrew, Ashwin, Jiri Dokulil, Naoya Maruyama, Oscar, Pietro Cicotti, Wilf, CJ

Welcome, intro
DHPC++ review
- Compare/contrast with OpenCL, OpenVX, Vulcan
Mini-Summit review
- Who gathered
- Slides should be integrated, some updates
- Overview
- Tabulation of results
- Review of poll/ratification
  - This broader audience also ratified what was listed
  - How do you connect different MPI worlds?
  - Clarify that HiHAT has to stage data across sub-clusters
  - Clarify granularity of work
- Sampling of requirements
  - Active messages (PNNL, Andrew)
  - Futures with data (OCR, Vincent; HPX, Hartmut Kaiser)
  - Callbacks on completion (OCR, Vincent)
  - Dynamic compilation (R-Stream, KART, LLVM)
  - Graph reuse (SWIFT/QuickShed, Stephen) - later
  - Partial I/O (SWIFT/QuickShed)
  - Feedback for auto-tuning (TensorRT)
  - Reproducibility via control
- Additional key issues to debate
Who else should be drawn in
- OpenVX
- Vulcan
- StarPU
- UINTAH
Topics for the future
- Portability, content of tasks - Carter
- Task scheduling for accelerators, SMP - Szilard
- Interoperation, remerging with other efforts, e.g. OpenCL, OpenMP - Szilard
- Performance analysis and monitoring APIs - Oscar
- Defining terms, e.g. future

Partner interactions, NVIDIA usage models, AMD efforts, Apr 18, 2017

Attendees (37) included: Wilf, Kath Knobe, Millad Ghane, CJ Newburn, Bill Feiereisen, Dmitry Liakh, Gordon Brown, Jesmin Tithi, Jans Johansen, Jiri Dokulil, John Feo, Kelly Livingston, Max grossman, Mauro Bianco, Oscar Hernandez, Patrick Atkinson, Piotr Luszczek, Ron Brightwell, Ruyman Reyes, Stephen Jones, Stephen Olivier, Sunita Chandrasekaran, Szilard Pall, Wael Elwasif, Ashwin Aji, George Bosilca, John Stone, Benoit Meister, Andrew Lumsdaine, Mike Bauer, Tim

CJ offered some recent highlights of partner interactions

ARM, AMD, IBM, NVIDIA engaging
User story, requirements and app updates
- Jim Phillips, NAMD; John Stone, VMD; Ronak Buch on Charm++, David Richards on transport Monte Carlo
- David Keyes, on categories of hierarchical algorithms
DHPC++ workshop, Toronto, May 16; will be a talk on HiHAT https://easychair.org/cfp/dhpcc17
Performance portability workshop, week of Aug 21, is expected to have some coverage of HiHAT

Upcoming HiHAT Mini-Summit

See info here

Teaser on NVIDIA usage models, Stephen Jones, NVIDIA

NVIDIA interested in HiHAT to broaden access of codes to resources in hetero platforms
Also for AI: deep learning and inference have available tasked-based parallelism
- Offered some background on DNN, RNN
As the lower bound on task granularity drops, more task parallelism may be available
Two ways to leverage fine-grained tasks better:
- reduce overheads for actions like invocation and moving data, instigated by CPU and performed on GPU --> lower-overhead Common Layer
- aggregate tasks in sequences and sub-graphs, that are passed down to target for localized handling --> richer tasking abstractions
Common requirements induced by inference and deep learning for HiHAT

Teaser on relevant AMD efforts, Ashwin Aji, AMD

ROCm = Radeon Open Compute, rebranding of HSA
- Similar to common layer, thin API that abstracts underlying compute and memory HW
- Task descriptors, lock-free data structures, door bells that trigger task execution
ATMI = Async Tasking and Memory Interface
- Kinds of tasks, low-latency signaling among tasks
Links
- ATMI info: http://gpuopen.com/compute-product/atmi/
- ATMI github: https://github.com/RadeonOpenCompute/atmi
- ROCm platform info: http://gpuopen.com/compute-product/rocm/
- ROCR (Runtime) API: https://github.com/RadeonOpenCompute/ROCR-Runtime

HiHAT overview, PaRSEC, Mar 21, 2017

Attendees included Wilf Pinfold, Benoit Meister, Patrick Atkinson, Schumann, George Bosilca, Piotr Luszczek, JimPhillips, Stephen Olivier, Max Grossman, Bill Feiereisen, Dmitry Liakh, Wael Elwasif, Jiri Dokulil, Gordon Brown, John Stone, Andrew Lumsdaine, Thomas Herault, Ronak Buch, Ashwin Aji, Bala Seshasayee, Michael Garland, Damien Genet, Aurelien Bouteiller, Oscar Hernandez, PSZ - Paul Szillard?, Timo (Blue Brain), Kelly Livingston, Antonio Tumeo, CJ, several more

CJ gave a HiHAT Overview

Progress in funding, e.g. from US government and vendors
Several posts to web, including from PASC, Charm++, VMD, Habanero tasking micro-benchmark suite
Upcoming report out on progress at GTC, morning of May 9 in San Jose
- Usage models and requirements
- Reveal initial progress on prioritized HiHAT interface design
Highlighted SW architecture of HiHAT, especially regarding pluggable modules, user layer with target-specific decision making with ease of use, and common layer that dispatches to target-specific implementations of actions
Call for more participation in identifying prioritized functionality of HiHAT to leverage, specific requirements and interfaces

George Bosilca of U Tennessee gave an overview of PaRSEC interaction with HiHAT

Data-centric programming environment based on async tasks executing on a hetero distributed environment
Offers a domain-specific language interface
Delivers good performance and scalability
SW architecture is based on modular component architecture of Open MPI, so it's quite amenable to plugging in HiHAT implementations for some of its functionality.
Prioritized wish list
- Portable and efficient API/library for accelerator support - data movement, tasks
- Portable, efficient and inter-operable communication library (UCX, libFabric, …)
  - Moving away from MPI will require an efficient datatype engine
  - Also supported by rest of the software stack (for interoperability)
- Resource management/allocation system
  - PaRSEC supports dynamic resource provisioning, but we need a portable system to bridge the gap between different programming runtimes
- Memory allocator: thread safe, runtime defined properties, arenas (with and without sbrk). (memkind?)
- Generic profiling system, tools integration
- Task-based debugger and performance analysis

Items for potential discussion and investigation

Enumeration - look at interaction with HWLOC
Dealing with unstructured data and data types
Data versioning
Serialized streams and subsequences of actions; may want cancellation
Resilience - detection, propagation
Interfaces for data movement, how that relates to MPI, collectives

OCR Review, Feb. 21, 2017

Wilf: Presentation material out on the wiki: OCR usage models is the one for today
Bala - OCR (Open Community Runtime), presents overview of OCR

Wilf: How do you decide on granularity of the task breakdown for AutoOCR? Is there some sort of input file?
Bala: Granularity is entirely the choice of the developer. AutoOCR is pretty straightforward - use a keyword to indicate that a task should be an EDT and annotate data blocks. Compiler will follow that and decorate with OCR API. It makes no decisions regarding granularity for itself. Compiler path is implemented in LLVM which looks at the keywords and generates OCR code.

Wilf: With MPI-Lite can you get some resiliency that you can't get from MPI?
Bala: That's interesting; we've not tried it. Resiliency & MPI-Lite have each been tried in isolation but not together.
Stephen Jones: How do people usually port to OCR?
Bala: People usually try to see if their MPI code can adapt to OCR. Will sacrifice performance while they see if they can implement in OCR. Some constructs like MPI_Wait are not aligned with OCR (which assumes an EDT can run to completion). Once people have adapted to OCR then there's no more reason to run MPI at all - they'll then restructure their program to reduce bottlenecks once they have a much better view of the dataflow graph.
CJ: What about continuation-style semantics.
Bala: A constant back-and-forth: should we stick to the "pure" model of no waits or stalls once a task has started? This would mean we need to split the task around a stall, but would also make data management complex between tasks. Some have looked at continuation semantics as a way to wait & context-switch within a task: moves the complexity into the runtime, which has to implement the continuation. Not many people have been trying this yet.
CJ: That's what Argobots & Qthreads are going after. HiHAT is looking to layer these on top of it to manage such continuations.
Bala presents on app requirements support
Wilf: What's performance looking like right now for e.g. MPI-Lite? How heavy is the task-based overhead at this time?
Bala: For MPI-Lite we've not put any effort into performance, because it's not trying to compete with MPI. OCR uses MPI for communication in this mode.Numbers look promising. At 16k cores OCR does not appear to perform any worse than MPI.
Wilf: How does resiliency play into this, if you've got 16k cores for example?
Bala: Not tried it at that scale yet. It will obviously slow things down. Has been tried out in isolation but not mixed together with performance yet.
Wilf: What about load-balancing? Was that 16k run fairly regular?
Bala: Again, have not yet tried this out in an application. In isolation, have used it at 64-node scale.
- Have tried it out with Mini-AMR and seen some good results but still wrestling with heuristics that are needed. More heuristic intelligence does not seem to provide a lot of benefit because of the overhead of coming up with intelligent heuristics.
Stephen Olivier: Do you have any full-sized apps you have results for?
Michael Wong: Do you have a regular OCR call?
MW: Have you looked at any bottlenecks inside OCR?
Bala: One of the things we're already aware of is the GUID implementation. Making it globally unique can be expensive and in practice you don't always need it to be truly global around the cluster: you only need uniqueness spatially or temporally. Suggests two types of GUID: truly global, and then more local UID.
- Can also probably shave off some overhead in event management (Legion has managed this, for example). You can often re-use events without the overhead of creation/destruction.
Wilf: Here's where we are with the meetings
- We've been using EventBrite for registration but it's getting a bit awkward. Trying to move over to MailChimp. We've got about 69 in the group (30 on the call today).
- Everyone will receive an email in the next week for registration. Use that to register, not EventBrite, in future please.
- Wiki will be kept with link to database of MailChimp info
CJ: Some higher comments & contexts
- Upcoming talks will look at the apps/algos which will be layered on top of HiHAT.
- Lots of good work in progress - appreciate people contributing and sharing
Michael Wong: One thing he's looking at is developing heterogeneous C++. If the group is interested he can send out some information about that. Also going to be running a workshop on ISO C/C++ and other high level heterogeneous C++ programming models here.
CJ: Want to look at these things and decide "would these be called BY HiHAT, or built on top of HiHAT?"
MW: Do have models which can build on top of HiHAT. Can have discussion at a later meeting.

Community meeting, Jan 17, 2017

Agenda

Welcome: Wilf Pinfold
Overview, purpose
Solicit apps that need hierarchical tasking
Solicit usage models
- Fully dynamic to semi-static - Pall
Solicit user stories (requirements)
- Map tasks to multiple GPUs - Dmitry
- Granularity - Pall
- Finite memory - Carter; see "Sandia" on Applications page
- Distributed data structures in finite memory - Toby
- For latency sensitivity apps, anything overheads need to be offset by significant gains - Pall
- Hierarchical topology - Toby
- Building libs for finite physical memory; libs cooperating with caller, e.g. via callbacks - John Stone
- Aggregated task groups, recursive task model that enables decomposition - Dmitry, Ashwin Aji
- Data affinity-driven binding and scheduling and data decomposition - Pall
- Move work to data vs. other way around - John
- PGAS support, data affinity and decomposition - Toby
Housekeeping - Wilf

Participants included: Wilfred Pinfold - creator, John Stone, umit@gatech.edu, Wael Elwasif, xg@purdue.edu Xinchen Guo, belak1@llnl.gov, Ruymán Reyes, pa13269@bristol.ac.uk Patrick Atkinson, Max Grossman, gordon@codeplay.com, bala.seshasayee@intel.com, mbianco@cscs.ch, ashwin.aji@amd.com, khalbiniak@icis.pcz.pl - Kamil Halbiniak, roman@icis.pcz.pl - Roman Wyrzykowski, fabien.delalondre@epfl.ch, richards12@llnl.gov, pszilard@kth.se - Pall, Michael Wong, Shekhar Borkar, David Bernholdt, rabuch2@illinois.edu, bill@feiereisen.net, cnewburn@nvidia.com, Piotr Luszczek, liakhdi@ornl.gov, Muthu Baskaran, jesmin.jahan.tithi@intel.com, slolivi@sandia.gov, hcedwar@sandia.gov - Carter, fuchst@nm.ifi.lmu.de - Toby, rbbrigh@sandia.gov - Ron

Signed up, but seemed not to make it: timothy.g.mattson@intel.com, schulzm@llnl.gov, oscar@ornl.gov[conflict], mbauer@nvidia.com, romain.e.cledat@intel.com, aiken@cs.stanford.edu, mfarooqi14@ku.edu.tr, lopezmg@ornl.gov, Benoit Meister, vgrover@nvidia.com, kelly.a.livingston@intel.com, alexandr.nigay@inf.ethz.ch, matthieu.schaller@durham.ac.uk, manjugv@ornl.gov, esaule@uncc.edu, schandra@udel.edu, cychan@lbl.gov, gshipman@lanl.gov, mgarland@nvidia.com, vsarkar@me.com, Didem Unat, maria.garzaran@intel.com, john.feo@pnnl.gov, mike.chu@amd.com, timothee.ewart@epfl.ch, jim@ks.uiuc.edu, n-maruyama@acm.org, pcicotti@sdsc.edu, kk13@rice.edu, srajama@sandia.gov

Kickoff, Dec. 20, 2016

Agenda

Welcome: Wilf Pinfold
Overview, purpose
Approach
Wiki explanation
Next steps
Feedback, expression of interest

Participants (33) included

BillF, CarterE, DavidR, Erik, JimP, KamilH & RomanW, PatrickA, PietroC, SenT, ShekharB, CJ, WilfP, StephenJ, XinchenG, TimM, RomainC, OscarH, AlexandrN, VinodG, KathK, Ashwin Aji, JoshF, GalenS, ManjuG, PallS, MariaG, ...
See calendar entry, if you signed up

Discussion

Glossary suggested by Tim, try not to invent new definitions
Report suggested by Oscar - summary of usage cases could be useful for DoE
How do we keep from getting fragmented? (Tim) Try to bringing community together by focusing on common requirements (Wilf)
Start with usage models, requirements, provisioning constraints, rather than comparing and contrasting specific implementations
We have data and experience to share
Looking to have a phone meeting 3rd Tue each month at 9am PST; some here had standing conflicts; Wilf to try a Doodle poll
Time scale, involvement, outputs?
Are we sold on async tasking? Driven more by efficiency on HW? (Shekhar) Yes (Oscar) Who needs it for what? We need compelling examples of where mainline DoE apps need it. (Dave Richards) Clever use of MPI goes a long way (Tim)
MPI: resilience not well addressed (Wilf) Comparison with MPI is inappropriate, tasking can be done on top of MPI, e.g. two-hot, accelerated MD. It's about the benefit of a computational model, which helps some and not others. (Galen) Tim agrees that MPI is low-level runtime.
Interesting to identify a set of apps that embody tasking, and understand why they chose that model (Galen) Sounds like a potential value proposition (Shekhar).
Characteristics: granularity of tasks - the finer the granularity the less portable the solution, explicit vs. implicit control (DaveR) If task relationships can be described, it can become more portable (Stephen) How will decomposition happen - expert, compiler, runtime? (DaveR)
How do we make this applicable to large, portable code bases, enabling productivity? Where does the tasking model emerge? (DaveR)
What does it mean to have an async environment, what are the critical features? (Josh)
The way to resolving differences at various levels may lie in hierarchy (Kath) Strongly agree with hierarchy (Tim)
Strongly agree with a bottom up approach, with a hierarchical perspective (Tim)

@@ Line 26: / Line 26: @@
 __TOC__
+!! Be sure to log in before you save, else changes will be lost !!
+= NVIDIA cuTile, Stephen Jones, Feb 3, 2026 =
+* Participants include Stephen Jones, Wilf Pinfold, Michael Wong, Remi Chassagnol, Boeschf, George Bosilca/NV, Hans Johansen, Ian Hendriksen/SNL, Jim Phillips, Jan Ciesko/SNL, Jeongnim Kim/Intel, Joseph Schuchart, Lukas Drescher, Maria Garzaran/Intel, Nitish, PiotrL, Stephen Olivier/SNL, Thiago, Tim Blattner/NIST, Walid Keyrouz/NIST, CJ/NV
+* Concise, high level abstraction, simpler than CUDA. Native types like arrays, no sync. Compilers have more freedom to optimize, more productive
+** Using unreleased AI agent to compare across programming models. See TileGym.
+* Span variations in underlying machine parallelism
+* TileIR is an available compiler target, sits alongside PTX, targeted from Triton, cuTile, existing or new DSLs
+* Family of languages - Rust, C++, others
+* Interops with other CUDA code
+* Compiler tries to figure out best place for memory, does pretty well, e.g. within 10-15% of peak. Customizes to target arch. Can override with hints. Can invoke DMA engine.
+* Hans: Tile size is important? How do you control it? SJ: Autotuner.
+= Template  Task Graph, Josh Schuchart, Jan 16 and May 21 and July 16, 2024 =
+* 5/21/24 Participants include Wilf, Joseph Schuchart, Alexandre Bardakoff, Andrey Alekseenko, Michael Wong, PiotrL, StephenJ, Szilard Pall, CJ
+* 7/16/24 participants include, Joseph Schuchart, Alexandre Bardakoff, Andrey Alekseenko, Michael Wong, PiotrL, StephenJ, Szilard Pall, Jan Ciesko, Walid Keyrouz, CJ
+* Overview
+** Based on graph work at ICL/UTK, including PARSEC
+** distr data flow as abstract teask graph
+** nodes are template tasks, no preemption
+** unrolled during execution - SPMD
+** concurrent execution across task graphs
+* Input and output edges
+** Can send data to some or all outputs
+** Task can't execute until all inputs available
+** Explicit naming of targets, using keys passed in as a paramter
+** CJ: No separation between implementation of a function and the context of that node's use. This precludes a separation of concerns
+** The intention here is designing the algo that consists of nodes and edges. No anticipation of task reuse.
+** Szilard: Complex schedule, record a sequence of nodes and edges. Prebaked schedule by a single dev who has most all of the context. If there are extensions, then you might lack reuse.
+** THought about splitting out of a task. Hard to send across functors.
+* Data
+** Moved into graph without making a copy
+** Infra for zero copy xfers across nodes
+** Haven't done this in the context of host-side NUMA systems wince it wasn't an issue, but are doing this for GPUs, where automatic load balancing wasn't good enough
+* Memory model - device
+** TTG manages device mem, host is a backup for transparent eviction wrt oversubscription
+** Buffer has logical home on host, currently always materialized there
+** CJ: we as a community are moving to consider accessors vs. materialized data buffers since data sets may not be able to fit in any amount of memory
+** Stephen: Do you shared the or partition data? No, just replicate.
+** Like Kokkos view, can be owning/non-owning. Track last update/access. Always get latest version.
+** Strong consistency, since all deps are explicit.
+* Execution model
+** declare input and scratch data
+** TTG runtime assigns device and execution stream based on inputs and device load
+** co_await select()'s device, co_await kernel() awaits kernel
+** Sending outputs is the last step of a task
+* Next time
+** Results, apps, compare/contrast discussion
+** Target next month, June 18
+/16/24 continuation
+* Composition granularity
+** Backward looking, based on integration pts [implicit dependencies] - data str (StarPU, Legion), mem loc (OpenMP, OmpSs), futures/promises; or task-wise composition, distr data str, implicit contracts
+** Forward looking [explicit deps], based on pure data flow, task graphs, algos composed thru edges
+** Ex: POTRF
+* Advantages of this approach
+** Treat tasks like functions
+** Free from objects on which deps get inferred
+* Challenges of this approach
+** Can't infer dep graph based on changes; dev has to grok and specify and update the whole graph
+** Lack of object-based dep management introduces potential flow control problems
+** Lots of versions of inputs when producer gets ahead of consumer
+** Back pressure. Add control flow edges as a back channel to send an empty msg to producer task. Experimenting with buffer depth.
+** CJ: Need a global view wrt resources? How does a give programming model approach help or hurt robustness? Consider a 5-stage pipeline where are all resources could get consumed by early stages that prevent progress in later stages.
+** CJ: Is there a separation between creating the graph via implicit inference and explicit construction and the instantiation of the graph with adequate resources?
+** Stephen: But there are more dynamic allocations of resources possible once the graph is known, rather than doing everything statically. There can be sparse data that would drastically overcommit resources in the static case.
+* Target apps
+** Block sparse matrix algos for sparse tensor algos in TiledArray. Not operating on fixed data structures. Tiles flow thru graph without materializing, so no risk of corrupting data. Would otherwise have to put a copy of data where dep could see it, make it globally visible. Instead, runtime can see and access and transparently copy local data.
+** Multi-resolution function analysis in MADNESS
+* CJ: Let's put a plan together to bring this back for a deeper-dive compare/constrast analysis of perf, resource usage, robustness, debuggability, productivity. Let's bring this back and try to draw in reps from OpenMP, OmpSs, StarPU, HPX to compare/contrast.
+* CJ: How about doing some of that analysis of perf and productivity at p3hpc.org's SC24 workshop? Maybe a panel?
+* Piotr: Maybe a WIP session there.
+= Kokkos Graphs and Conditionals, Jonathan Liflander, Jan 17, '23 =
+Participants: Stephen Jones, Wilf, Jonathan Liflander, SzilardP, CJ
+Simple diamond conditional, pick one saxpy or another.
+* 4 nodes
+* Non-conditional version does a deep copy of the conditional value to CPU, evaluates conditional on CPU and invokes a graph with one side or the other
+Perf
+* More iterations helps - negative for 10 or 100, positive for 1K, 10K
+* Instantiating graph takes about as much time to run the graph once
+** Without conditionals: 216us, seems high
+** With conditions: 2.76ms, seems very high - order of magnitude more
+* Larger problem sizes wash out the impact, as expected
+* If the instantiation with conditionals was similar, conditionals would have been a clear win
+* Warmup didn't matter much.  Stephen: been working to reduce sensitivity for that in CUDA 12
+Conclusion
+* Looks promising; some gain now and we'd hope that'd be much more helpful
+* Common motif among several folks at Sandia.  Conditionals are a big gating factor.  Looking forward to more conversations with Kokkos kernels team.
+* Working on CG now, implementation nearly done, hoping for next week
+* Benefits may be unclear, especially if instantiation costs aren't predictable
+Feedback
+* Stephen working to formalize the mechanism; should reduce overheads
+* Jonathan to share test case with Stephen, would like to have Jonathan chat with the graphs team
+* Not everything you can put in a graph can be put in a while
+Discussion
+* CJ: Are there or will there by Python wrappers for Kokkos?  JL: Not that much need.  Mostly C++ or Fortran.
+* CJ: Can be difficult to recognize the processing of a list as a loop, which could have implications.
+* Jonathan is also working on Darma, doing work on load balancing in Python, interacts with C++ code.  Uses output to guide distribution.  This could be relevant in that domain.
+* Szilard: Python used only for high-level drivers, not in GROMACS context.  More focused on balancing high- vs. low-level APIs.  Some need for interaction with data operated on by inner loop, but the desire is for that interaction to be flexibly done with a Python API.
+* Jonathan: Python used more as a higher level.  Data fed from app, in situ.
+* Szilard: Similar - hot loop, capture and export data, e.g. for viz, sometimes modifying.  Can imagine some perf requirements.  Cases not so well documented, lots in early stages.
+* JL: Can tank perf if not careful -> Python is pure driver code
+* CJ: Expressing control flow in a way that's amenable to Python may be of increasing interest
+= CUDA Graph Updates, Stephen Jones, Oct 18, '22 =
+Participants: Stephen Jones, Wilf, David Fontaine, PiotrL, WalidK, SzilardP, CJ
+* External dependencies via events and memops
+** in/out deps not permitted with CUDA Graphs, have to split them into diff graphs connected with sync
+** Memops: cuStreamWaitValue(), cuStreamWriteValue()
+* Stream-ordered memory alloc: async take a stream argument
+** cudaMalloc/FreeAsync
+** Can allocate in one graph and free in another, virtual addresses are unique across graphs, but phys addresses may be reused
+** Can change topology of graph for edges connecting nodes involving memory management
+** CUDA may add more edges to serialize to avoid OOM.  Create a memory pool before beginning to execute a graph instance that satisfies the worst-case paths thru the graph.  Wait to start the graph if you can't create that pool.  Add extra edges based on the conditions evaluated at the start of that instance's execution.
+** CUDA implicitly inserts postdominators as necessary in order to track when a graph completes and all of its held pool of unallocated memory can be made available for other graphs
+* Dynamic parallelism
+** used to be able to launch kernels from kernels, now you can launch graphs from graphs
+** Encapsulation boundary - GPU stream within a CPU stream
+* Named streams in CUDA 12.0 later this year; CUDA names classes of patterns that you can use; you're not naming streams
+** cudaStreamPerThread - tell CUDA you can optimize around a straightline pattern
+** cudaStreamFireAndForget - device-side kernels issue concurrently vs. sequentially
+** cudaStreamTailLaunch - sequential after parent completes
+** 3x faster with fork/join disabled (22.5 -> 7.2 us), very close to CPU launch (6.5 us).  1.14x in Mandelbrot test
+* Encapsulation for device-side graph launch
+** Whole launching graph, so can't create new dep that induces fork/join within the parent graph
+** Similar named streams for graphs
+** Adding sibling launch for loops.  Sibling is launched outside of deps, a level up, creating a nested encapsulation
+** Can make a decision at runtime about launching an appropriately-selected graph by relaunching scheduler after each launch in a loop
+** Launch from device is 2.2x faster than from host 4.5us vs. 9.9us.  Lower latency from a shorter control loop.
+** Extra bookkeeping on CPU is avoided, so pretty flat on GPU vs. scales with concurrency on CPU
 = Hedgehog Static Analysis and Multi-node Applications on Heterogenous Systems April 19, 2022 =
@@ Line 34: / Line 174: @@
 Presentation 1: Hedgehog Static Analysis - Alexandre Bardakoff
+[[Media:Presentation HiHat Static Analysis.pdf|Presentation HiHat Static Analysis]]
 Goals
 *Conformity checks with Template Metaprogramming
-*Creation of a Compile-time graph
+*Static graph with constant expressions
 Presentation 2:Multi-node Applications - Nitish Shingte
+[[Media:Presentation HiHat Cluster.pdf|Presentation HiHat Cluster]]
 Goals
-*Extend Hdgehog to support Multi-node Applications
+*Extend Hdgehog to support Multiple Task graphs Multiple Nodes
 [[File:YouTube_icon.png|20px|link=https://youtu.be/e2VueSpXsRM]]

HiHAT Meeting Minutes: Difference between revisions

Latest revision as of 18:00, 3 February 2026

Contact Information for HiHAT Monthly Reviews

Meeting Minutes

NVIDIA cuTile, Stephen Jones, Feb 3, 2026

Template Task Graph, Josh Schuchart, Jan 16 and May 21 and July 16, 2024

Kokkos Graphs and Conditionals, Jonathan Liflander, Jan 17, '23

CUDA Graph Updates, Stephen Jones, Oct 18, '22

Hedgehog Static Analysis and Multi-node Applications on Heterogenous Systems April 19, 2022

Adding Conditionals to Kokkos Graphs Mar 15, 2022

Benchmarking Kokkos Graphs Oct 19, 2021

MPI forum WG on streams and graphs Sep 21, 2021

Graph Characterization July 20

Characterization of Sample cases

Heterogeneous task scheduling of molecular dynamics in GROMACS - Szilard Pall

Thoughts on Integrating CudaGraph Features into Performance Portability Libraries, Oct 20 and Nov 24, Daisy Hollman

Tasking_in_Accelerators_CudaGraph+OpenACC, Aug 18, 2020, Leonel Toledo, Antonio Pena

Hedgehog, Alex Bardakoff and Tim Blattner, Mar 17/Apr 21/May 19/Jun 16/July 21, 2020

CUDA Graphs characterization update, Feb 18, 2020

DaCe, CUDA Graphs, Jan 21, 2020

Graphs: HiHAT, QMCPACK, May 21, 2019

Graphs and interoperable building blocks, Apr 23, 2019

Graphs and tasking, Mar 12, 2019

Hierarchical decomposition, Oct 16, 2018

C++ Wrappers for HiHAT, Sep 18

Graphs: HIHAT and DMGR++, Aug 21

Dynamic tasking and HiHAT Graphs, July 17

Report out on C++ Layering on GPUs workshop, June 19

C++, CUDA Graphs, May 15

DARMA: A software stack model for supporting asynchronous, data effects programming, Apr 17

Asynchronous operations support in HiHAT, Mar 20

Hierarchy, Part II Feb 20

Data management interfaces: What we have now, user requirements, what we need, Jan 16, 2018

Hierarchy, Nov 21

Comparison SyCL and HiHAT; Supporting Hetero and Distributed Computing Through Affinity, Michael Wong, Oct 17

Recent developments, profiling, OpenMP, enumeration and other features, Sep 19

StarPU and HiHAT; HiHAT design ideas - Aug 15, 2017

More users, proof of concept plans, high-level design doc, June 20, 2017

Mini-Summit Synthesis, May 16, 2017

Partner interactions, NVIDIA usage models, AMD efforts, Apr 18, 2017

HiHAT overview, PaRSEC, Mar 21, 2017

OCR Review, Feb. 21, 2017

Community meeting, Jan 17, 2017

Kickoff, Dec. 20, 2016

Navigation menu

Search