This wiki page is used to keep minutes from phone and face to face meetings on the topic of usage models, user stories, and applications for heterogeneous hierarchical asynchronous tasking. Most recent meetings are listed on top. See [Presentations] for the ongoing agenda for monthly meetings and the materials that were posted.

A link to get back up to the parent page is here.

Hedgehog, Alex Bardakoff and Tim Blattner, Mar 17, 2020

Participants include ()

TimB, Alex Bardakoff, DavidB, DmitryL, GeorgeB, JamesB, Jiri Dokulil, Jose Monsalve Diaz, MauroB, PiotrL, Szxilard, WaelE, WalidK, Wilf, CJ

State
- Associated with each action
- Can have multiple state managers
Memory management
- Can stall and wait for another free using a condition variable
- Can have multiple memory managers
Graphs
- Profiling and graphical representation helps visualize bottlenecks
Safety
- Coherency. Want to use C==20 context object
Perf
- Scales linear with threads, usec latency is tolerable
- Good performance on Windows
Applications
- Microscopy, ...
Follow up items
- Connecting multiple graphs, e.g. across nodes
- Memory management plumbing and gating
- Feedback on HiHAT's profiling, collaboration on a common profiling API

CUDA Graphs characterization update, Feb 18, 2020

Participants include (9)

Wilf, Szilard Pall, Tim Blattner, Piotr, Stephen, Vinay Amatya, Wael Elwasif, CJ

Mail problems
- Many people didn't get either an invite or an email
- Some (Szilard) but most others not getting MailChimp mails and calendar invites.
- Stephen: Outlook enables an iterative mail merge
- CJ: Jan event looked like a one-off vs. a series
- Wilf to pick a small set of people to come up with a workable solution

GROMACS - Szilard
- Thinks that a broader review could be of interest
- Has time available concerns in the next couple of months, similar for other teammates
- Would like an update from Alan/NV DevTechs working on this
- May be better to come back after some results
- Transitioning and modernizing a fairly large code base with the aim of encapsulating tasks and making them amenable to efficient task scheduling. So it could be the right time to enumerate requirements for expressing tasks and dependencies.
- Interested in what is possible and not

Usage model driven - CJ
- Identify leading examples, go deep
- First: high-level assessment of necessary features and challenges - can be done without a large investment, prerequisite that guides investment in next level of detail
- Next: low-level details of selected specifics
- Three levels: enabling of current features, incremental extensions to current features, big gaps that affect long-term architecture
- Others are interested in discussions on high-level requirements
- Interested in usage patterns so we can create worked examples for end to end solutions

Stephen on CUDA Graphs characterization
- Evaluation in the context of COMB public mini-app from LLNL
- Amdahl can kill concurrency
- Serial stream time is directly proportional to kernel length
- Some fixed overhead (15%) for multiple streams but mostly agnostic to kernel length
- CUDA Graphs incurs fixed overhead to set up graph, get config into the machine, worse than multiple streams
- CUDA Graphs reuse is a win over all of the above by at least 15%
- Graph Update reduces cost of updating parameters without changing the topology
- 33% execution overhead reduction vs. concurrency streams with Graph relaunch
- use double-pointer dereference to avoid changing parameters in the graph itself

Discussion
- Szilard on non-GPU tasks: haven't optimized CPU callbacks, which a future release targets
- Tim on traditional data flow: Processors aren't well suited for data flow. Many apps aren't really pure dataflow since they run consumer after producer is completely finished. Building a DF layer on top of this would be another future step.
- Tim has been trying to map into Hedgehog at NIH. Got 92% efficiency on GPUs. Perhaps combine CPU and GPU interaction via dataflow.

DaCe, CUDA Graphs, Jan 21, 2020

Participants include (18)

BillF, CJ, George, JeffL, Jesun Firoz, John Feo, Jose, Muthu, Naveen (HPE), Piotr, StephenJ, StephenO, Tal, Wael, Wilf

HiHAT Graphs on CUDA Graphs - slide preview: Media:HiHAT_Graphs_190916.pdf, Wojtek Wasko
- HiHAT Graphs working on top of CUDA Graphs.
- For now, CUDA Graphs APIs are called immediately and directly, vs. creating a graph independently and then instantiating it
- Can be generalized to support targets that don't support CUDA Graphs
- Can be extended to execute on remote targets
DaCe - Tal Ben-Nun
- Embedded C++ DSL for stateful parallel dataflow programming on stateful DFG
- Add state, including tasklets, arrays, map/exit for parametric abstraction, stream, consume/exist dynamic mapping, conflict resolution for writes
- Supports various accelerators
- Covers a variety of corner cases
- Includes a development environment that shows source, transformations, properties, generated code
- Perf
  - Favorable wrt MLK, Halide. Didn't unroll whole graph like Halide. Could have used systolic arrays.
  - 90% of CUTLASS
  - Know exactly what and when to copy - key advantage for FPGAs
- pip install dace
- Connections to HiHAT, CUDA Graphs
  - Currently using OMP parallel sections for CPUs, GPUs
  - Predates CUDA Graphs
  - Use HiHAT as a back end
- Questions
  - Analysis of space complexity?
CUDA Graphs in QMCPACK
- Application phases - variadic Monte Carlo with and without drift, diffusion Monte Carlo
- Problem: very many small kernels, seeking to reduce launch overheads and increase concurrency
- Solution outline: Jared's executors with CUDA Graphs graph capture
- CUDA 10.2 enables updates after capture -> instantiation cost only paid on structural changes
- Nsight Compute now supports graph visualization
- Perf: 1.28x overall, 1.09x from graphs alone, with some as a wash
- Lessons learned
  - graph update makes things significantly simpler
  - app had much more reliance on default stream then expected; need to avoid that for concurrency
  - error check often to understand where they happen
  - more at GTC

Graphs: HiHAT, QMCPACK, May 21, 2019

Participants include (22)

CJ, DavidB, DamienG, Dmitry, George, Hans, Jared, JeffL, Jesun Firoz (PNNL), JohnS, Jose, Millad, Oscar, Piotr, Siegfried Benkner (U Vienna, OCR), Stephen, Vinay, Wael, Walid, Wilf, Ying Wai Li

CUDA Graphs / HiHAT Graphs - CJ for Wojtek

"Update on HiHAT Graphs on CUDA Graphs"
Showed basic types and APIs
Have a simple POC working where HiHAT Graphs calls straight through to CUDA Graphs APIs to create nodes, edges, graphs, instantiate and invoke
In the next stage, we'll build up a graph in HiHAT and then instantiate or invoke all of it in CUDA Graphs
The HiHAT design is a little more general than current CUDA Graphs
- Supports

QMCPACK on CUDA Graphs: WIP snapshot - Jeff Larkin

"Use of CUDA Graphs in QMCPACK"
Moving a subset of kernels into CUDA Graphs to reduce launch latencies and increase concurrency
Used Jared Hoberock's Executors Prototype, which is on github
- standard CUDA launch vs. executor launch
- kernel functions vs. function objects
- native graphs vs. executor graphs
Now leveraging graph capture
Challenges
- Poorly optimized streams
- Lack of parameter capture
- Callbacks are expensive, can only pass back a void*. Have to do a deep copy.
- Host lambdas didn't always work with executor graphs when merged with other graphs
  - Synchronous execution may make memory lifetime management less of an issue, but as things move to be async, lifetime management can become an issue. This can be an issue especially wrt stack vs. heap memory usage.
  - The executor prototype garbage collection upon graph destruction may have issues.
  - C++ lambdas may capture more temp variables than the programmer realizes.
  - Jared is working on a revision of the prototype and may look at this problem.
  - CPU callback nodes in CUDA Graphs weren't built to be very sophisticated, Stephen is now interested in having another look. May need some auto tracking of lifetimes with a system like Jared's.
  - Elimination of data movement between GPU and CPU was a nice side benefit.
Status
- Significant effort went into getting rid of CPU dependencies
- Some tools issues got reported
Results
- Some gains without reuse: 9%
- Recapture API isn't released yet; experimenting with that, expecting instantiation overheads to improve
Future work
- Make graph fatter, which will make better use of GPU
- Planning to finish backporting kernels to native graphs and executors
- Expanding scope
Lessons learned
- Avoid host nodes for now
- Function overloading may break
- Watch out for copies to/from unpinned buffers - undefined behavior
- Jared's abstractions were much friendlier to write in
Feedback
- Dmitry: Consider capturing by value
- Dmitry: Consider reference counting

Graphs and interoperable building blocks, Apr 23, 2019

Participants include (24)

Antonino, CJ, Dmitry, Hans, Himanshu, James, Jeff, Jesun, Jiri, Jose, Marcin, MichaelG, Millad, Piotr, Siegfried, Stephen, Vinay, Walid, Wael, Wojtek, Ying-wei and others

"Toward Common and Interoperable SW Infrastructures - ECP Annual Meeting report out", George Bosilca, Mike Heroux and CJ Newburn

Increasing interest in sharing common and interoperable infrastructure under mashups that cross app domains
17 projects from DoE shared what primitives and services they want
Poll indicates fairly strong (theoretical) interest in sharing

CUDA Graphs update - Stephen Jones

Based on CUDA 10.1
- For 3 2us kernels, there's a 53% overhead in the kernels-only case
- For the same sequence, reducing launch cost with reused graphs drops overhead to 46%
- And device-side execution overhead reductions get you to 37% overhead; that's a net 26% reduction
- With graph relaunch, CPU speedup can be 7+x and GPU-side speedup can be 1.4x for a straight-line graph
- Overheads per kernel launched: 2.1->0.29us in launch times, 1.57->1.11 per kernel
- Embedded mobile inference (Tu104): 6-11x CPU-side launch, .95-3x GPU execution side
Can use cudaStreamBegin/EndCapture and replay. Must be replayable. Capture doesn't actually run the captured code. This is different than tracing. Also captures parameters to kernels, size of memory copies, arguments to functions.

Graphs and tasking, Mar 12, 2019

Participants include (33)

Wilf, Jared, Jose, JohnB, JamesB, JohnS, DavidB, Himanshu Pillai (ORNL), JeffL, JoshS, KamilH, Roman, MikeC, MilladG, OscarH, PiotrL, StephenO, TomS, VinayA, WaelE, WalidK, YingWai Li (ORNL), Ashwin Aji, Paul Besl, SPuthoor, Jesun

Executors on CUDA Graphs, Jared Hoberock
- Explained theory, approach
- Showed sample implementation and results
- Described opportunities and limitations
- Open sourced the sample code here
"OpenACC CUDA Graphs", James Beyer
OpenMP tasking directions, Tom Scogland and James Beyer
- Looking at leveraging and working with systems like this, in the context of OpenMP
- Want OpenMP to work with C++ executors and graphs
- Replaying graphs may involve a significant refactoring of code outside of a loop to capture and replay
- Working at creating an interface that enables the creation of a reusable task
- Need to work out how to pass in parameters
- Graph construct might contain regions for OpenMP and OpenACC, they might do the update for us as long as it has the appropriate hooks.
- James: use of side table to track important info
- CUDA Graphs API can be mapped to by a compiler, but with limited control. The executor approach may lend itself to providing more control.
Report out on ECP Breakout, "Toward Common and Interoperable SW Infrastructures", |George Bosilca, Mike Heroux and CJ Newburn - deferred to next time

Hierarchical decomposition, Oct 16, 2018

Participants include (21)

BillF, CJ, DavidB, Dmitry, HansJ, James, MarcinZ, Mehdi, MikeB, Oscar, Piotr, StephenO, Swaroop Pophale, Vinay, Wael, Wilf, Wojtek, Ying Wai Li

Graph gen

SLATE now
May be relevant for BMRG

Data partitioning

ExaTensor
Legion implements user decisions
PNNL

ExaTensor - Dmitry Liakh

Recursive tensors
Depth of recursion controlled by user, uniform across targets, stops at node level now
Deep recursion, for hetero nodes, not currently supported; HiHAT could be relevant there
Data partitioning function - single, provided by app, user adjustable parameters and filters
Work creation function - single, provided by app, guided by existing recursive data partitioning
Graph generation function - no explicit graphs now

Adaptive Mesh Refinement - Mike Bauer

Principle: only model what matters - unstructured or structured but hierarchical meshes
Background on AMR
Hierarchy of data, computation, SW
Lends itself well to recursion
Mapping to machine resources is difficult

C++ Wrappers for HiHAT, Sep 18

Marcin Copik, ETHZ

Participants include (15)

David Bernholdt, Dmitry Liakh, Ferrol Aderholdt, James Beyer, Jose Monsalve, Marcin Copik, Millad Ghane, Muthu Baskaran, Wael Elwasif, Wilf Pinfold, Wojtek Wasko, Ying Wai Li, CJ

C++ wrappers for HiHAT

Not trying to cover all of HiHAT initially, WIP
With Tal Ben-Nun, who is traveling this week
Goals: simplify - cleaner, more robust, human readable; namespaces, RAII; analogous to SYCL vs. OpenCL
Namespaces
- hh:: and hh::experimental but not hh{u,c,n,h,e}
- maybe nesting, like hh::graph
- distinction between user and common layer? Encapsulation, ease of user, but also bare bones for performance. Fine to start with user layer and extend to common layer.
Wrapper classes
- No error handling
- Wrap C objects, query members, get it directly; optimized for most common scenario, e.g. defaults for common parameter values
Graph API
- Walked through sample codes for C API and corresponding C++ API
- Introduced | and & operators to capture serial or parallel dependence relationships among nodes -> easy to specify and read
- Overload function call operator, enhance simplicity for static case, to complement dynamic forms
Enum-based flags
- classes offer implicit scope, static type checking
- but could lead to less-robust interfaces if Enum compression (compiler switches) introduces incompatibilities wrt struct and parameter sizes
- Enum flags are often OR'd together as an int, maybe provide implicit conversion operator inside Enum class to avoid lots of typing
- Mitigated by having a header-only library, maybe make the change to use Enums in C++ only
Error handling
- C API use error codes, use HIHAT_CHECK macro to handle errors
- C++ options
  - return only error codes - what if ctor fails?
  - exceptions - every function has to throw for consistency
  - return both value and error codes, which requires expected-like implementation
- Marcin and Tal: prefer exceptions
- MichaelG: adding exceptions to libs can cause problems later, but would be highly undesirable if a device-side interface were ever to be added
- CJ: it's conceivable that a HiHAT instance could some day run on a less-capable device like a GPU, in CUDA. I'd urge caution about precluding that.
- Dmitry: Can we plumb a C++ user function that wants to use C++ exceptions through a C API? He'll consider this, maybe offer an example.
Minor suggestions
- Sync clean-up: more-comprehensive destruction, e.g. collapse destroy, sync, free with async postComplete
Questions
- Preferred way of handling errors? Proposed exceptions, but want to consider broader retargetability. Need to work through examples with C++ implementations.
- Enums vs. ints in function parameters? Yes, in C++
- Header-only on top of HiHAT C lib? Ues
- How do we make programming easier? Graphs, shortcuts
- Target only common usage scenarios? Yes
Intending to provide a prototype implementation
We'll point more folks to this for broader review

Graphs: HIHAT and DMGR++, Aug 21

Participants include (22)

Arghya Chatterjee and YingWai Li and Oscar Hernandez (ORNL), CJ, Dmitry, JamesB, Jeff Larkin, Jose Monlsave, Marcin, Michael Garland, Mike Bauer, Monzan, Stephen Jones, Stephen Olivier, Szilard, Tal, Walid, Wilf, Wojtek, Wonchan

Graph API overview, Presentation

CJ Newburn and Wojtek Wasko
Stephen: Does support for concurrent launch belong in this interface? CJ: Maybe, but perhaps that's just the responsibility of the HiHAT client above.
Stephen: How about making sure that senders and receivers are ready, in case there's a large delay? CJ: Seems like this should be something that's required of implementations rather than the dispatch architecture, and relegated to the trait system.
Dmitry: what about critical sections? CJ: Sounds like OpenMP, wanting to work with Tom Scogland and others on OpenMP mapping.
Tal: Been working with HiHAT in their context, want to propose a C++ interface. They posted an example to Google drive
CJ: Several apps/runtime folks meeting weekly on support for graphs; feel free to join us and get access to detailed work on that

Graph characteristics for DMRG++, Presentation

Arghya (Ronnie) Chatterjee, Yingwai Li, Oscar Hernandez - ORNL
Dense Matrix Renomalization Group (DMRG++)
Graph characteristics
- No dependencies among cells, but dependencies among tasks for reduction
- Graph may change over time or may depend on dynamic data
- Load imbalance across patches within matrices, gets worse in latter phases. OpenMP fork/join/barrier overhead can get horrible. Need a better way to manage dynamic scheduling; needs more investigation.

Dynamic tasking and HiHAT Graphs, July 17

Participants include (23)

CJ, David Hollman, Stephen, Oscar, Charles Jin, Damien Genet, David Bernholdt, Dmitry, Andrew, Antonino, Jeremy, Jesun Firoz, JohnB, Jose Monsalve (UDel), Millad, Sunita, Swaroop Pophale (ORNL), Vinay, Walid (NIST), Wilf, Muthu Baskaran

Requirements and design for HiHAT Graphs

Dynamic graphs

Dmitry, DavidH, Andrew, ...
Cases for dynamic addition
- Prior to instantiation
- After instantiation - continuation with same resources
- After instantiation - follow on with new resources
Cases for dynamic selection
- Superset of nodes on superset of resources
Discussion
- Static - known before instantiation vs. dynamic - structure not known before instantiation vs. semi-static which have multiple alternatives
- Andrew: May create a CSR matrix, stop and do an update, create new CSR and continue
- Dmitry: Additional stage for partial instantiation?
- CJ: Could/should node creation, instantiation and invocation be fully async so that a later phase can be pipelined
- Stephen: instantiation has computational complexity to O(# nodes); invocation has O(1 graph);
- Stephen: what if different iterations induce different subgraphs?
- Sunita: can partition a matrix into (tiles), may have special treatment for leading diagonal - dependencies across wave front; well explored for FPGAs, where different modules are created. Consider Smith Waterman, which is well implemented in CUDA but not with tasks.
- DavidH: Still thinking about cost trade-offs, e.g. can get exponential explosion with unions. In situ graph modification vs. up front "unionizing." Like MPI+X's two-layer programming model layered across different HW where the ability to express generically and make trade-offs in compiler or runtime is needed
- CJ: Could create a template subgraph, instantiate it, and generalize the template subgraph and its . DavidH: Jonathan Liflander had an SC18 paper submission on dynamic caching of common subgraphs.
- Jesun: level correcting algorithms may not wait for all predecessors before starting execution and they may iterate on updated until signaled to terminate; would that be supported? If it's a DAG, you can't get deadlock. He'll forward some more notes on this.
- Jesun: Termination determination is another concern. Trade-offs between managing that by runtime or app developer. Level-based algorithms always proceed forward. With general async, one can go back and forth across levels. He's willing to add some content to the Modelado site.
- JohnB: For creating more work from GPU, you could create an additional CPU node from which the Graph APIs could be called.
- Dmitry: Or this could be done with a CUDA callback.

Report out on C++ Layering on GPUs workshop, June 19

Participants in this June 19 phone meeting include (16):

Andrew, Antonino, CJ, Dmitry, Gordon, Jared, Jiri, Max, Michael, Mike Chu, Millad, Stephen, Tim Blattner, Walid, Wilf, Wojtek

Workshop participation (24 local, 7 remote)
- CSCS of Switzerland, German/Switch Universities, DoE Labs, NVIDIA, Codeplay
Overview
- Looked at how to layer C++ on CUDA Graphs, plumbing down through executors/futures, HiHAT
- Reviewed in light of application requirements
- 9 groups indicated interest in contributing code to this collaborative effort
Points of agreement
- Graphs are an abstraction of interest. It looks like graphs can be built up using the (revised) proposals for executors and futures. We need to work through examples to build confidence in this. Different executors will be necessary for static building or dynamic building of graphs.
- CUDA Graphs look interesting enough to try. Of particular interest are lowering overheads in support of fine granularity, graph reuse, graphs with control flow in them. We should collaborate on creating a set of proofs of concept implementations of executors in support of CUDA Graphs.
- Some runtimes implement primitives for several targets and may benefit from HiHAT. They can try out HiHAT to see if it provides ease of use, simplicity, performance and robustness. Those interested can “spend a day” identifying a candidate use, studying the API doc and doing a trial run with HiHAT.
CUDA Graphs
- Benefits: reducing overheads for small tasks, repeated graphs. Resource management, dynamic control flow, CPU-less interop. Improved GPU utilization.
- Workload and framework reps rated various aspects of these benefits
Flow
- Create one or more graphs consisting of actions as nodes, with dependencies
- Partition into subgraphs by target; bind vertices and order them
- Augment with additional vertices as required to manage memory, data movement sync
- Augment with additional vertices for interactions among graphs
HiHAT
- Can wrap CUDA Graphs, support multiple targets with potentially fewer restrictions and better affinity
- Enable portability for handling graphs as a whole, not just a collection of vertices
- Enable interactions among graphs, for same or different targets
- Working on APIs to wrap graphs
Discussion
- Andrew: There's some consideration of updating the interfaces for Boost Graph Library to a more modern version of C++. Could be beneficial to have some 2-way communication about this. Andrew to follow up.

C++, CUDA Graphs, May 15

Participants include (32):

Andrew, Andrew, BillF, Carter, DavidB, Damien, Dmitry, Walid, George, Hans, Hartmut, Jesun, Jiri, JohnB, JoseM, Louis Jenkins, Manju, MichaelG, MikeB, Millad, MarcinZ, Piotr, Siegfried, StephenJ, Szilard, Umit, Wael, Walid, Wojtek

C++ Directions - Michael Garland

Beyond parallelism to async, data coordination
Key ingredients
- Identify things: pointers, iterators, RANGES
- Identify place to allocate storage: allocators
- Identify place to execute threads: executors
- Identify dependencies: FUTURES
- Identify affinity: conforming INDEX SPACES for threads and data
C++17 parallel algos
- for_each(par - execution policy, begin, end, function)
- NVIDIA's Thrust library, for CUDA C++
Executors
- Multiplicative explosion between diverse control structures and execution resources
- Mediate access with a uniform abstraction
Asynchrony
- Async keyword
- Chaining - maximize flexibility, composability, perf [interoperability]
- Wish to not hide dependencies, to not inadvertently bind code execution
- Permits more implentations, including HW mechanisms and construction of graphs
Developments
- Generalized actions: invocation, mem mgt, data movement, sync
- Standalone ops --> predecessor actions in context --> description of input data
- Async/deferred vs. immediate

CUDA Graphs - Stephen Jones

Forthcoming feature in CUDA + some research ideas
Want more insight? Have questions? Let us know so you can join follow up sessions with more detail.
Graphs
- Per node, any GPU or CPU, fan-in/out to any degree, multiple root/leaf nodes
- Provide more context (semantic, resources) toward HW
- Define -> instantiate -> execute
Concurrency
- More concurrent than streams, which are used for ordering with other work
Reduce overheads
- Invoke many actions vs. one - O(us)
- Reduce kernel-kernel latency - O(us)
- Building, e.g. resource binding, bookkeeping can be done offline
- Avoid centralized bottleneck as processing is distributed to targets
- Most relevant for many small kernels
Can't do with this
- Automatic placement; best choice may depend on data locality which is not known at execution layer
- Only execution vs. data dependences
- No splitting or merging of graph nodes
Can do with this
- Rapid re-issue
- Hetero node types
- Cross-device dependencies
Discussion
- JohnB: CPU code on GPU? Pre-bound, but flexible.
- Dmitry: dependence management
- Wojtek: HiHAT is retargetable by design, wraps the pluggable implementation which is CUDA Graphs. Supports interoperable implementations of sync, data movement in ways that support Michael's signal, evaluate, schedule, launch stages

DARMA: A software stack model for supporting asynchronous, data effects programming, Apr 17

Presenter Jeremy Wilke, Sandia

Attendees (31) included

Ashwin Aji (AMD), Benoit Meister, Carter Edwards (NV), David Bernholdt (ORNL), DamienG, DmitryL (ORNL), Hans Johansen (LBL), James Beyer (NV), Jeremy (Sandia), John Biddiscombe (CSCS), Kamil Halbiniak, Marcin Zalewski (PNNL), MauroB (CSCS), Michael Garland (NV), Michael Wong (Codeplay), Millad Ghane (Houston), Muthu Beskaran, Oscar (ORNL), RonB, Ruyman Reyes (Codeplay), Szilard, Wael Elwasif (ORNL), Walid Keyrouz (NIST), Wilf, Wojtek Wasko (NV)

Jeremy's talk

Express tasks with flexible granularity
- Elastic tasks
- Breadth first or depth-first
Relevant apps
- Dynamic load balancing: Multi-scale physics, PIC, tree search, AMR with fast shockwave
- Semi-static load balancing: PIC load balancing, block-based sparse linear solvers with irregular sparsity, AMR with slow shockwave
- Static, flexible granularity: tile-based linear algebra, FE matrix assembly, complex chemistry
Data effects
- extract concurrency, focus on locality and granularity (size, shape, boundaries)
- Permissions for immediate and scheduling: modifiable, read only, none
Metaprogramming within C++
- C++ wrapper classes AccessHandle<T>
  - Like a future, but no blocking get method
  - Required for dependence analysis
- Capture
- Task creation functions
Debugging
- Shared memory only vs. distributed, sequential
Layering/backend
- Charm++ (tested at scale), MPI/OpenMP (POC), HPX (LSU/Hartmut POC and Thomas), Kokkos (WIP), std::threads (done)
C++ activities
- Executors, futures, span/mdspan, atomics, deferred reclamation through hazard pointers and RCU
Induced requirements, layering
- Lower layers handed dependencies in a DAG
- DARMA data structures are operands of tasks

Discussion
- JohnB: how do permissions this go beyond C++ const? J: We use const
- JohnB: Observation - the richer the scheduler, the greater the complexity
- OscarH: can tasks also be data parallel, e.g. OpenMP? J: Yes. Significant engineering problems with nested parallelism. Express a cost/perf model for elasticity of tasks, not fully defined for DARMA.
- CJ: hierarchical approach? J: Left up to lower layers, as guided by perf models. Both DARMA and Kokkos have parallel_for. Lower layers do binding and ordering.
- CJ: DAG must be materialized in its entirety? What about dynamic task generation? J: Yes.

Asynchronous operations support in HiHAT, Mar 20

Presenter: Wojtek Wasko, NVIDIA

Attendees (24) included

Andrew, Ashwin Aji, BillF, CJ, Dave Bernholdt, Dmitry, Gordon, Jesun Firoz, John Biddiscombe, Jose Monsalve (UD), Marcin Zalewski, Marcin, Mauro, Michael Wong, Mike Bauer, Piotr, Szilard, Umit, Walid Keyrouz, Wojtek Wasko, Wael, Wilf

HiHAT Async Operations
- Background
  - Reviewed actions, action handles, sync objects
  - ActionHndl owned by HiHAT dispatch layer
  - SyncObject and its format, management, interaction owned but pluggable implementation
- Requirements
  - Action handles
    - Used to link according to dependences
    - Can logically combine (and/or) multiple ActionHndls into a single result - optimize for the common case of 1 reaching input dependence
    - Can query state
    - Can obtain underlying (post-dominating) sync object; debate: blocking or non-blocking wrt pluggable implementations
  - Sync objects
    - Based on object's description or type
    - Must be able to completely bypass HiHAT and enable direct communication with legacy code
- Samples
  - Provided in a CUDA context; we welcome examples and contributions for other architectures
- Discussion
  - Support full async, e.g. in querying sync object that may not have been provided by underlying pluggable implementation?
  - Can support both blocking query for sync object and non-blocking "is it available yet" API
  - MPI erred on the side of adding more APIs, and had both blocking and non-blocking APIs
  - Can always have a blocking implementation
  - Preference was for a richer API
  - Wilf: can alloc, movement, invocation, loading code in a deferred fashion? We might wish to not over-strain memory capacity for code and data, for example. CJ: Yes, everything is async and the underlying pluggable implementation makes those things happen when ready. There can be a backpressure at the enqueuing time, but we don't have a sample implementation of that yet.

Upcoming community activities
- Workshop at CSCS in Zurich June 10-11. Likely focus is layering for async tasking, e.g. C++/HPX/maybe HiHAT/CUDA Graphs. Let us know if you're interested. John Biddiscombe of CSCS is hosting.

Hierarchy, Part II Feb 20

Attendees included

CJ, John Stone, Antonino, Gordon, Kamil, Roman, Marcin, Mauro, Michael Garland, Michael Wong, Millad, Piotr, Siegried, Szilard, Walid, JamesB, Mauro, AndrewL, David Bernholdt, Dmitry, Jose, Oscar, MarcinZ, Stephen, rfvander, manz551, Wael Elwasif, Umit Catalyurek, John Biddiscombe, Wilf, Ashwin

Recent community activities
- LifeSci Tech Summit at NVIDIA - discussions about CUDA Graphs
- OpenMP - interest in implementing HiHAT under OpenMP
- CSCS - considering a workshop on layering infrastructure under C++, e.g. HPX / HiHAT / CUDA Graphs. John Biddiscombe extended an invitation to help prepare for that.
- ECP Annual meeting - HiHAT was mentioned a couple of times as a possible candidate for underlying infrastructure, several side meetings, including for memory and storage interfaces and OpenMP

Possible approaches for hierarchical dense linear algebra kernels (Vivek)
- Express code simply - flat or polyhedral frameworks
  - AndrewL: is template metaprogramming considered simple? Vivek: was focused on representation vs. linguistic syntax, which is orthogonal
- Explore space of data distributions - hierarchical, block/cyclic, AoS/SoA
- Explore hierarchical decompositions - async, hierarchical place trees
  - Can place onto various places in the hierarchical tree, thereby separating semantics from perf tuning
- Explore affinity hints/declarations - temporal locality

ExaTensor: hierarchical processor of hierarchical tensor algebra - Dmitry Lyakh
- Hierarchical decomposition
- Replicate with prefetch
- Currently focused on block sparse, hierarchical, with no predefined regular pattern -> heuristics and predetermined data mappings

"Longing for Portability of Performance" (Weather/climate) - Mauro Bianco
- Single source code
- Stencils with complex dependencies
  - 10s-100s per time step, fairly big tasks but don't necessarily fill a GPU, more than one thread
  - conditional execution, limited halo lines
- Data parallel and task-oriented
  - From a dev point of view
  - User-managed granularity is not portable - auto *splitting* isn't generalizable
  - Aggregation seems more applicable - thread switching -> function calls, inlining. Still not universally optimal, e.g. parallel scan
  - Express the *finest granularity*, else function call too big, coarsening with inlining is possible. Inspector/executor model? May need new *algorithmic patterns/motifs* to make data parallel, e.g. with a keyword.
  - CJ: what are the prospects for expressing parallelism and data in a hierarchical way to a scheduler which can decompose as needed, vs. reactively? Limitations wrt particular application domains? Mauro: Sequoia tried to do some of this; limited progress. Vivek: compiler support helps with granularity, runtime helps with the mapping; expressing at fine-grained level

Looking forward
- Transition from static to dynamic

Data management interfaces: What we have now, user requirements, what we need, Jan 16, 2018

Attendees included Wilf, George Bosilca, Gordon Brown, Jeff Larkin, Millad Ghane, John Stone, Jose Monsalve, Damien Genet, Michael Garland, Stephen Olivier, Bill Feiereisen, James Beyer, Antonino Tumeo, David Bernholdt, John Feo, Marcin, Szilard Pall, manz551, Siegfriend Benker

Presentation

Interest in enumeration of resources: Gordon @ Codeplay

Feedback

PNNL has a project for HIVE on Exascale platforms that may include GPUs. John Feo: Content here is relevant for scheduling, data marshaling.
Umpire/CHAI guys not here today, but evaluating that
Multi-dimensional support (George): may be done above the HiHAT layer, these interfaces seem adequate
What about thread safety? (George)
- Often use memory contexts that are shared among threads on the same socket
- HiHAT clients (above) and coordinate between what's been registered, e.g. thread safe implementation, and what that client wants
New implementations could be registered dynamically

Hierarchy, Nov 21

Attendees, included: James Beyer, Jeff Larkin, Max Grossman, Vivek Sarkar, David Bernholdt, Dmitry Lyakh, Michael Garland, Mike Bauer, Piotr, Ruyman Reyes, Siegfriend Benner, Stephen Olivier, Szilard Pall, Wael, Elwasif, Wilf, Kamil Halbiniak & Roman, Millad Ghane, CJ Newburn

Vivek Sarkar, GA Tech
- Habanero, CnC, OCR
- Places - can distribute data, can use type system to distinguish between local/global
- Locality-aware scheduling using hierarchical place tree
  - Different abstractions for diff HW, e.g. diff # levels and kinds of memory
  - Affinity annotations - can express preferences vs. hard assignments
  - Can pass abstract work down the hierarchy, do work stealing at lower levels
  - Supports spatial and temporal sharing
- Undirected graph, not just a tree; use trees where profitable
- Multiple levels of parallelism and heterogeneity
Dmitry Lyakh, ORNL
- ExaTensor, a distributed tensor library based on hierarchical data representation
- Adaptive dynamic block-sparse representation of many-body tensors
  - Resolution of each block is dynamically adjustable
- Recursive definition of storage, computational resources.
  - Data centric - induces task decomposition. But tasks can follow data and be aggregated. Data storage granularity and task granularity are decoupled.
  - Computational resources are encapsulated as virtual processors - can do linear algebra, tensor ops
- Targeted for Summit
Mike Bauer, NVIDIA
- Legion
- Adaptive mesh refinement, algebraic multigrid
  - Very dynamic behaviors - partitioning changes, data created and destroyed at runtime, depends on domain-specific knowledge
  - Folks at LANL now working on AMR in Legion
- Hierarchy
  - Levels that correspond to levels of details; may correspond to different data structures
  - Hierarchical decomposition of the same data structures into different levels
- Requirements
  - Want primitives for describing partitions efficiently and effectively - can be error prone
  - Capture descriptions in DSL that apply across many different kinds of apps
Discussion
- Provisional/on-demand decomposition
  - ExaTensor: User specifies work to do in high-level DSL. Automatic decomposition, e.g. based on maintaining arithmetic intensity, data transfer bandwidths. Not much to trade-off since limited by DSL. Expected a need for further generalization as the scope of apps is expanded.
  - Legion: Often 2 ways to decompose - 1) breadth, across nodes or processes within node, 2) depth, e.g. NUMA, across GPUs or SMs. Both may need to change dynamically and that needs to be efficient. No one size fits all. Apps describe the partitioning algorithms. Mappers pick the best decomposition for a given target. Tunable parameters are specified to the mapper.
  - Vivek: dynamic code generation has a part to place; specialize to runtime-determined data distribution; has a student experimenting with the on CNN for inner loop bodies wrt data characteristics
- Follow up
  - Pick some concrete examples and work them through
  - Trade-offs based on cost models, e.g. recompute vs. fetch halo data
    - MikeB: Recomputing could make code complexity high, hard to maintain. Could be relevant for Halide.
    - Dmitry: compression could be a factor
    - Jim Demmel @ Berkeley: Communication-avoiding algorithms

Comparison SyCL and HiHAT; Supporting Hetero and Distributed Computing Through Affinity, Michael Wong, Oct 17

Attendees, including:

Michael Wong, Andrew Lumsdaine, Carter Edwardsd, Jiri Dokulil, Kath Knobe, David Bernholdt, George Bosilca, Gordon Brown, Jeff Larkin, Marcin Zalewski, Mauro Bianco, Max grossman, M Copik, Michael Garland, Piotr, Ruyman Reyes, Siegfriend Benner, Thomas Herault, Wael Elwasif, Wilf, rfvander, Stephen Olivier, Szilard Pall, Wojtek Wasko, James Beyer, Sean Treichler, Bill Feiereisen, Dmitry Liakh, Jesun Firoz, dgenet, Millad, OscarH, Ashwin Aji

SYCL
- Retargetable from C++, entirely standard C++ (no keywords, no pragmas)
- Single-source host and device compilation model
- Separate storage and access of data, specify where data stored/allocated
- Task graphs
- C++ lacks ABI, have to provide symbol name for kernel, could be improved with static reflection
- Several apps on top of SYCL now: vision for self-driving cars, ML with Eigen and Tensorflow, parallel STL with ranges, SYCL-BLAS, Game AI
- Comparison with Kokkos, HPX, Raja
  - All: C++11/14, execution policies to separate concerns, shape of data, aiming to be subsumed by future C++
  - SYCL: mem storage vs. data model, dependence graph, single source/multi-compilation
  - Kokkos: mem space, layout space
  - HPX: distributed computing nodes, execution policy with executors
  - Raja: IndexSet, segments
- Proposing C++2020 hetero interface
- SYCL for HiHAT
  - SYCL aligning more with C++ futures/executor/coroutines
  - Exploring HiHAT vs. OpenCL as low-level interface, layered below ComputeCpp, which is target agnostic
  - Plug in binary blobs for vendor-specific components
  - Async API
  - Enumeration of device-specific capabilities
  - Time-constraint ops, for safety critical SYCL
    - In context of safe and secure C++
    - Removing ambiguities and undefined behaviors
    - Componentized, multi-layer, well tested
  - Could use for alloc, copy, invoke
- Codeplay biz and HiHAT
  - Licensing, IP protection
  - HiHAT in their stack
  - Certification for HiHAT-compliant devices/implementations

SC17 BoF on Distributed/Hetero C++ in HPC
Workshop on Distributed/Hetero Programming in C++, IWOCL, Oxford
C++ P07986r0: Support Hetero and Distr Computing Thru Affinity
- Resource querying - dynamic? hwloc, which is primarily hierarchical?
- Binding and allocation
- Affinity - relative? migration?
- CJ: Please extend to support an async interface to allocation and affinitization
- Wilf: app developer manages policies like affinity?

Recent developments, profiling, OpenMP, enumeration and other features, Sep 19

Attendees, including:

Jeff Larkin, Francois Tessier, Ronak Buch, Marc Snir, Piotr, Samuel Thibault, Wael Elwasif, Wilf, Carter Edwards, John Stone, Benoit Meister, UD CIS, Marcin Zalesski (PNNL), Walid Keyrouz, Firo, Sunita Chandrasekaran, James Beyer, Stephen Olivier, Galen Shipman, Hartmut Kaiser, Jesun, Jiri Dokulil Stephen Jones, Mike Bauer, Millad Ghane, Siegfried Benner

Recent developments
- Repo with CLA being readied at Modelado
- Good feedback and interactions at DoE Perf Portability Workshop in August, led to more of a common view, more collaboration
- HiHAT paper accepted at (Post-Moore's Era) PMES workshop at SC17
- Expecting a 2-hour meeting on HiHAT at SC17 to share progress, usage and plans for HiHAT
- Affinity BoF @ SC17 (Emmanuel Jeannot) looks to be relevant; let's plan some pre-work
Profiling (Samuel Thibault)
- What can be profiled, what can be done (callbacks now, counters coming)
- Request: share input on additional states that should be profiled
- Walked through an example with different allocations, copies, invocation on CPU, GPU
- Showed integration with StarPU that uses Paje.trace and vite from its tool suite
OpenMP above and below HiHAT (James Beyer)
- Use omp parallel for inside a task. Can warm up an OpenMP hot team with a separate invocation off of the critical path.
- Replacement of GOMP_parallel with HiHAT trampoline - HiHAT-based scheduler for improved retargetability
- OpenMP affinity - progressing from HiHAT ignoring it, being informed about it, and invoking from one set of threads to another set
- Resource subsets
- Looking at trying this out in Clang
Enumeration and other features in HiHAT (CJ Newburn)
- Currently using inlined source code with static initializers, one for each kind of platform
- Integrating with hwloc to automate that. Will support NUMA nodes, CPUs and memories that hang off of them, accelerators (FPGA, GPU, DLA, PVA, etc.) and their memories
- Plugin of target-specific implementations
  - Each API, at user and common layer
  - Allocator, at each memory instance
  - Selected by resources (e.g. copy endpoints) and a chooser
Solicitations
- Language layering, e.g. task implementations
- Memory management, e.g. allocator tradeoffs - per-thread nurseries, same-size allocators, etc.
- Thread management, e.g. Qthreads, Argobots
- Client layering, e.g. SHMEM

StarPU and HiHAT; HiHAT design ideas - Aug 15, 2017

Attendees (33), including:

Andrew Lumsdaine, Antonino Tumeo, Ashwin, Benoit Meister, CJ Newburn, D Genet, George Bosilca, Gordon Brown, James Beyer, Jose Monsalve, Kath Knobe, Marc Snir, Max Grossman, Michael Garland, Millad Ghane, Naoya Maruyama, Oscar Hernandez, Pall Szilard, Piotr, Ronak Buch, Samuel Thibault, Siegfried Benkner, Stephen Jones, Stephen Olivier, Sunita Chandrasekaran, Thomas Herault, Wael Elwasif, Wilf Pinfold, Wojciech Wasko

StarPU, Samuel Thibault, INRIA
- Supports sequential semantics, general retargetability, out of core, clusters, memory consumption control
- Can simulate - as if performance
- HiHAT wish list
  - Low-overhead interface to HW layers
  - Reusable components - perf models, allocators, tracing, debugging
  - Task-based interface, plus OpenMP, helpers for outlining and marshaling
  - Interoperability
  - Shared event management - user-defined events, interop with what's not covered by HiHAT
  - Memory alloc - uniform low-level API, efficient sub-allocator, same-size memory pools, hierarchical balancing
  - Disk support: store/key/value
  - Prioritized actions
- Some responses
  - Very good alignment with HiHAT
  - Pluggable implementations can provide memory management; tuner can provide an allocator per memory device, or share them
  - Event system will provide abstractions over implementation-specific events and semaphores in memory
  - Simulation could be an interesting service above HiHAT
    - Something above HiHAT is in control
    - Emulation/simulation above HiHAT acts as is the actions really happened
- Possible joint investigations
  - Whether anything special is needed for specialized allocators, memory load balancing
    - Interested: Samuel, Marc, Benoit, George, Ashwin, Max
    - Compare with MPI allocators that deal with fixed size and conflict avoidance, consider progress thread implications, developments in MPI 4 endpoints (Marc Snir)
  - Tracing formats and debugging
    - Interested: Samuel, Max
- Implementation and interface exploration
  - Prioritized actions
  - Integration of MPI wait with action dependence system
    - Interested: Marc, Jesus, Samuel, Andrew, George
- Discussion
  - How do you handle running out of memory? See paper on memory control on StarPU website. Increase the granularity of what's submitted for execution.

HiHAT design teasers, Sean Treichler and CJ Newburn
- Going stateless
- Resource handling
- Memory abstraction and traits
- Execution scopes
- Please send mail to cnewburn@gmail.com if you're interested in more offline discussion/presentation on these topics.

More users, proof of concept plans, high-level design doc, June 20, 2017

Attendees (43): Antonio Tumeo, Ashwin Aji, BillF, David Bernholdt, DebalinaB, DGenet, Dmitry Liakh, Firo017, Gordon Brown, Hans Johansen, James Beyer, Jesun (PNNL), Jiri Dokulil, John Stone, Kamil Halniniak and Roman, Kath Knobe, Keeran Brabazon (ARM), Mauro Bianco, Michael Garland, Mike Bauer, Mike Chu, Millad Ghane, Minu455, Naoya Maruyama, Piotr, Rob Neely, Ronak Buch, Ruyman Reyes, Samuel Thibault, SharanA (NVIDIA Tegra), Siegfried Benkner (U Vienna/StarPU), Stephen Olivier, Szilard Pall, Thomas Herault, Tim Blattner, Vincent Cave, Wael Elwasif (ORNL), CJ, ...

Some new participants: NIST/HTGS, UINTAH, StarPU, more from NVIDIA, e.g. automotive
- Tim Blattner presented slides (see Presentations)
Proof of concept
- Review of POC plan doc (see Presentations)
- John Stone, VMD and molecular orbitals
- Some discussion of the benefits of dynamic scheduling
- There's a value to progressive back off on the dynamism of scheduling, potentially based on profile-driven need - John, Szilard, Wilf
High-level design doc (see Presentations)

Mini-Summit Synthesis, May 16, 2017

Attendees

Carter, David Bernholdt, George, Max, Michael Garland, Michael Robson (PPL/UIUC); Millad, Patrick, Piotr, Thomas Herault, Dmitry, Szilard, Toby, Wael, Damien, Andrew, Ashwin, Jiri Dokulil, Naoya Maruyama, Oscar, Pietro Cicotti, Wilf, CJ

Welcome, intro
DHPC++ review
- Compare/contrast with OpenCL, OpenVX, Vulcan
Mini-Summit review
- Who gathered
- Slides should be integrated, some updates
- Overview
- Tabulation of results
- Review of poll/ratification
  - This broader audience also ratified what was listed
  - How do you connect different MPI worlds?
  - Clarify that HiHAT has to stage data across sub-clusters
  - Clarify granularity of work
- Sampling of requirements
  - Active messages (PNNL, Andrew)
  - Futures with data (OCR, Vincent; HPX, Hartmut Kaiser)
  - Callbacks on completion (OCR, Vincent)
  - Dynamic compilation (R-Stream, KART, LLVM)
  - Graph reuse (SWIFT/QuickShed, Stephen) - later
  - Partial I/O (SWIFT/QuickShed)
  - Feedback for auto-tuning (TensorRT)
  - Reproducibility via control
- Additional key issues to debate
Who else should be drawn in
- OpenVX
- Vulcan
- StarPU
- UINTAH
Topics for the future
- Portability, content of tasks - Carter
- Task scheduling for accelerators, SMP - Szilard
- Interoperation, remerging with other efforts, e.g. OpenCL, OpenMP - Szilard
- Performance analysis and monitoring APIs - Oscar
- Defining terms, e.g. future

Partner interactions, NVIDIA usage models, AMD efforts, Apr 18, 2017

Attendees (37) included: Wilf, Kath Knobe, Millad Ghane, CJ Newburn, Bill Feiereisen, Dmitry Liakh, Gordon Brown, Jesmin Tithi, Jans Johansen, Jiri Dokulil, John Feo, Kelly Livingston, Max grossman, Mauro Bianco, Oscar Hernandez, Patrick Atkinson, Piotr Luszczek, Ron Brightwell, Ruyman Reyes, Stephen Jones, Stephen Olivier, Sunita Chandrasekaran, Szilard Pall, Wael Elwasif, Ashwin Aji, George Bosilca, John Stone, Benoit Meister, Andrew Lumsdaine, Mike Bauer, Tim

CJ offered some recent highlights of partner interactions

ARM, AMD, IBM, NVIDIA engaging
User story, requirements and app updates
- Jim Phillips, NAMD; John Stone, VMD; Ronak Buch on Charm++, David Richards on transport Monte Carlo
- David Keyes, on categories of hierarchical algorithms
DHPC++ workshop, Toronto, May 16; will be a talk on HiHAT https://easychair.org/cfp/dhpcc17
Performance portability workshop, week of Aug 21, is expected to have some coverage of HiHAT

Upcoming HiHAT Mini-Summit

See info here

Teaser on NVIDIA usage models, Stephen Jones, NVIDIA

NVIDIA interested in HiHAT to broaden access of codes to resources in hetero platforms
Also for AI: deep learning and inference have available tasked-based parallelism
- Offered some background on DNN, RNN
As the lower bound on task granularity drops, more task parallelism may be available
Two ways to leverage fine-grained tasks better:
- reduce overheads for actions like invocation and moving data, instigated by CPU and performed on GPU --> lower-overhead Common Layer
- aggregate tasks in sequences and sub-graphs, that are passed down to target for localized handling --> richer tasking abstractions
Common requirements induced by inference and deep learning for HiHAT

Teaser on relevant AMD efforts, Ashwin Aji, AMD

ROCm = Radeon Open Compute, rebranding of HSA
- Similar to common layer, thin API that abstracts underlying compute and memory HW
- Task descriptors, lock-free data structures, door bells that trigger task execution
ATMI = Async Tasking and Memory Interface
- Kinds of tasks, low-latency signaling among tasks
Links
- ATMI info: http://gpuopen.com/compute-product/atmi/
- ATMI github: https://github.com/RadeonOpenCompute/atmi
- ROCm platform info: http://gpuopen.com/compute-product/rocm/
- ROCR (Runtime) API: https://github.com/RadeonOpenCompute/ROCR-Runtime

HiHAT overview, PaRSEC, Mar 21, 2017

Attendees included Wilf Pinfold, Benoit Meister, Patrick Atkinson, Schumann, George Bosilca, Piotr Luszczek, JimPhillips, Stephen Olivier, Max Grossman, Bill Feiereisen, Dmitry Liakh, Wael Elwasif, Jiri Dokulil, Gordon Brown, John Stone, Andrew Lumsdaine, Thomas Herault, Ronak Buch, Ashwin Aji, Bala Seshasayee, Michael Garland, Damien Genet, Aurelien Bouteiller, Oscar Hernandez, PSZ - Paul Szillard?, Timo (Blue Brain), Kelly Livingston, Antonio Tumeo, CJ, several more

CJ gave a HiHAT Overview

Progress in funding, e.g. from US government and vendors
Several posts to web, including from PASC, Charm++, VMD, Habanero tasking micro-benchmark suite
Upcoming report out on progress at GTC, morning of May 9 in San Jose
- Usage models and requirements
- Reveal initial progress on prioritized HiHAT interface design
Highlighted SW architecture of HiHAT, especially regarding pluggable modules, user layer with target-specific decision making with ease of use, and common layer that dispatches to target-specific implementations of actions
Call for more participation in identifying prioritized functionality of HiHAT to leverage, specific requirements and interfaces

George Bosilca of U Tennessee gave an overview of PaRSEC interaction with HiHAT

Data-centric programming environment based on async tasks executing on a hetero distributed environment
Offers a domain-specific language interface
Delivers good performance and scalability
SW architecture is based on modular component architecture of Open MPI, so it's quite amenable to plugging in HiHAT implementations for some of its functionality.
Prioritized wish list
- Portable and efficient API/library for accelerator support - data movement, tasks
- Portable, efficient and inter-operable communication library (UCX, libFabric, …)
  - Moving away from MPI will require an efficient datatype engine
  - Also supported by rest of the software stack (for interoperability)
- Resource management/allocation system
  - PaRSEC supports dynamic resource provisioning, but we need a portable system to bridge the gap between different programming runtimes
- Memory allocator: thread safe, runtime defined properties, arenas (with and without sbrk). (memkind?)
- Generic profiling system, tools integration
- Task-based debugger and performance analysis

Items for potential discussion and investigation

Enumeration - look at interaction with HWLOC
Dealing with unstructured data and data types
Data versioning
Serialized streams and subsequences of actions; may want cancellation
Resilience - detection, propagation
Interfaces for data movement, how that relates to MPI, collectives

OCR Review, Feb. 21, 2017

Wilf: Presentation material out on the wiki: OCR usage models is the one for today
Bala - OCR (Open Community Runtime), presents overview of OCR

Wilf: How do you decide on granularity of the task breakdown for AutoOCR? Is there some sort of input file?
Bala: Granularity is entirely the choice of the developer. AutoOCR is pretty straightforward - use a keyword to indicate that a task should be an EDT and annotate data blocks. Compiler will follow that and decorate with OCR API. It makes no decisions regarding granularity for itself. Compiler path is implemented in LLVM which looks at the keywords and generates OCR code.

Wilf: With MPI-Lite can you get some resiliency that you can't get from MPI?
Bala: That's interesting; we've not tried it. Resiliency & MPI-Lite have each been tried in isolation but not together.
Stephen Jones: How do people usually port to OCR?
Bala: People usually try to see if their MPI code can adapt to OCR. Will sacrifice performance while they see if they can implement in OCR. Some constructs like MPI_Wait are not aligned with OCR (which assumes an EDT can run to completion). Once people have adapted to OCR then there's no more reason to run MPI at all - they'll then restructure their program to reduce bottlenecks once they have a much better view of the dataflow graph.
CJ: What about continuation-style semantics.
Bala: A constant back-and-forth: should we stick to the "pure" model of no waits or stalls once a task has started? This would mean we need to split the task around a stall, but would also make data management complex between tasks. Some have looked at continuation semantics as a way to wait & context-switch within a task: moves the complexity into the runtime, which has to implement the continuation. Not many people have been trying this yet.
CJ: That's what Argobots & Qthreads are going after. HiHAT is looking to layer these on top of it to manage such continuations.
Bala presents on app requirements support
Wilf: What's performance looking like right now for e.g. MPI-Lite? How heavy is the task-based overhead at this time?
Bala: For MPI-Lite we've not put any effort into performance, because it's not trying to compete with MPI. OCR uses MPI for communication in this mode.Numbers look promising. At 16k cores OCR does not appear to perform any worse than MPI.
Wilf: How does resiliency play into this, if you've got 16k cores for example?
Bala: Not tried it at that scale yet. It will obviously slow things down. Has been tried out in isolation but not mixed together with performance yet.
Wilf: What about load-balancing? Was that 16k run fairly regular?
Bala: Again, have not yet tried this out in an application. In isolation, have used it at 64-node scale.
- Have tried it out with Mini-AMR and seen some good results but still wrestling with heuristics that are needed. More heuristic intelligence does not seem to provide a lot of benefit because of the overhead of coming up with intelligent heuristics.
Stephen Olivier: Do you have any full-sized apps you have results for?
Michael Wong: Do you have a regular OCR call?
MW: Have you looked at any bottlenecks inside OCR?
Bala: One of the things we're already aware of is the GUID implementation. Making it globally unique can be expensive and in practice you don't always need it to be truly global around the cluster: you only need uniqueness spatially or temporally. Suggests two types of GUID: truly global, and then more local UID.
- Can also probably shave off some overhead in event management (Legion has managed this, for example). You can often re-use events without the overhead of creation/destruction.
Wilf: Here's where we are with the meetings
- We've been using EventBrite for registration but it's getting a bit awkward. Trying to move over to MailChimp. We've got about 69 in the group (30 on the call today).
- Everyone will receive an email in the next week for registration. Use that to register, not EventBrite, in future please.
- Wiki will be kept with link to database of MailChimp info
CJ: Some higher comments & contexts
- Upcoming talks will look at the apps/algos which will be layered on top of HiHAT.
- Lots of good work in progress - appreciate people contributing and sharing
Michael Wong: One thing he's looking at is developing heterogeneous C++. If the group is interested he can send out some information about that. Also going to be running a workshop on ISO C/C++ and other high level heterogeneous C++ programming models here.
CJ: Want to look at these things and decide "would these be called BY HiHAT, or built on top of HiHAT?"
MW: Do have models which can build on top of HiHAT. Can have discussion at a later meeting.

Community meeting, Jan 17, 2017

Agenda

Welcome: Wilf Pinfold
Overview, purpose
Solicit apps that need hierarchical tasking
Solicit usage models
- Fully dynamic to semi-static - Pall
Solicit user stories (requirements)
- Map tasks to multiple GPUs - Dmitry
- Granularity - Pall
- Finite memory - Carter; see "Sandia" on Applications page
- Distributed data structures in finite memory - Toby
- For latency sensitivity apps, anything overheads need to be offset by significant gains - Pall
- Hierarchical topology - Toby
- Building libs for finite physical memory; libs cooperating with caller, e.g. via callbacks - John Stone
- Aggregated task groups, recursive task model that enables decomposition - Dmitry, Ashwin Aji
- Data affinity-driven binding and scheduling and data decomposition - Pall
- Move work to data vs. other way around - John
- PGAS support, data affinity and decomposition - Toby
Housekeeping - Wilf

Participants included: Wilfred Pinfold - creator, John Stone, umit@gatech.edu, Wael Elwasif, xg@purdue.edu Xinchen Guo, belak1@llnl.gov, Ruymán Reyes, pa13269@bristol.ac.uk Patrick Atkinson, Max Grossman, gordon@codeplay.com, bala.seshasayee@intel.com, mbianco@cscs.ch, ashwin.aji@amd.com, khalbiniak@icis.pcz.pl - Kamil Halbiniak, roman@icis.pcz.pl - Roman Wyrzykowski, fabien.delalondre@epfl.ch, richards12@llnl.gov, pszilard@kth.se - Pall, Michael Wong, Shekhar Borkar, David Bernholdt, rabuch2@illinois.edu, bill@feiereisen.net, cnewburn@nvidia.com, Piotr Luszczek, liakhdi@ornl.gov, Muthu Baskaran, jesmin.jahan.tithi@intel.com, slolivi@sandia.gov, hcedwar@sandia.gov - Carter, fuchst@nm.ifi.lmu.de - Toby, rbbrigh@sandia.gov - Ron

Signed up, but seemed not to make it: timothy.g.mattson@intel.com, schulzm@llnl.gov, oscar@ornl.gov[conflict], mbauer@nvidia.com, romain.e.cledat@intel.com, aiken@cs.stanford.edu, mfarooqi14@ku.edu.tr, lopezmg@ornl.gov, Benoit Meister, vgrover@nvidia.com, kelly.a.livingston@intel.com, alexandr.nigay@inf.ethz.ch, matthieu.schaller@durham.ac.uk, manjugv@ornl.gov, esaule@uncc.edu, schandra@udel.edu, cychan@lbl.gov, gshipman@lanl.gov, mgarland@nvidia.com, vsarkar@me.com, Didem Unat, maria.garzaran@intel.com, john.feo@pnnl.gov, mike.chu@amd.com, timothee.ewart@epfl.ch, jim@ks.uiuc.edu, n-maruyama@acm.org, pcicotti@sdsc.edu, kk13@rice.edu, srajama@sandia.gov

Kickoff, Dec. 20, 2016

Agenda

Welcome: Wilf Pinfold
Overview, purpose
Approach
Wiki explanation
Next steps
Feedback, expression of interest

Participants (33) included

BillF, CarterE, DavidR, Erik, JimP, KamilH & RomanW, PatrickA, PietroC, SenT, ShekharB, CJ, WilfP, StephenJ, XinchenG, TimM, RomainC, OscarH, AlexandrN, VinodG, KathK, Ashwin Aji, JoshF, GalenS, ManjuG, PallS, MariaG, ...
See calendar entry, if you signed up

Discussion

Glossary suggested by Tim, try not to invent new definitions
Report suggested by Oscar - summary of usage cases could be useful for DoE
How do we keep from getting fragmented? (Tim) Try to bringing community together by focusing on common requirements (Wilf)
Start with usage models, requirements, provisioning constraints, rather than comparing and contrasting specific implementations
We have data and experience to share
Looking to have a phone meeting 3rd Tue each month at 9am PST; some here had standing conflicts; Wilf to try a Doodle poll
Time scale, involvement, outputs?
Are we sold on async tasking? Driven more by efficiency on HW? (Shekhar) Yes (Oscar) Who needs it for what? We need compelling examples of where mainline DoE apps need it. (Dave Richards) Clever use of MPI goes a long way (Tim)
MPI: resilience not well addressed (Wilf) Comparison with MPI is inappropriate, tasking can be done on top of MPI, e.g. two-hot, accelerated MD. It's about the benefit of a computational model, which helps some and not others. (Galen) Tim agrees that MPI is low-level runtime.
Interesting to identify a set of apps that embody tasking, and understand why they chose that model (Galen) Sounds like a potential value proposition (Shekhar).
Characteristics: granularity of tasks - the finer the granularity the less portable the solution, explicit vs. implicit control (DaveR) If task relationships can be described, it can become more portable (Stephen) How will decomposition happen - expert, compiler, runtime? (DaveR)
How do we make this applicable to large, portable code bases, enabling productivity? Where does the tasking model emerge? (DaveR)
What does it mean to have an async environment, what are the critical features? (Josh)
The way to resolving differences at various levels may lie in hierarchy (Kath) Strongly agree with hierarchy (Tim)
Strongly agree with a bottom up approach, with a hierarchical perspective (Tim)

HHAT Usage Meeting Minutes

Contents

Hedgehog, Alex Bardakoff and Tim Blattner, Mar 17, 2020

CUDA Graphs characterization update, Feb 18, 2020

DaCe, CUDA Graphs, Jan 21, 2020

Graphs: HiHAT, QMCPACK, May 21, 2019

Graphs and interoperable building blocks, Apr 23, 2019

Graphs and tasking, Mar 12, 2019

Hierarchical decomposition, Oct 16, 2018

C++ Wrappers for HiHAT, Sep 18

Graphs: HIHAT and DMGR++, Aug 21

Dynamic tasking and HiHAT Graphs, July 17

Report out on C++ Layering on GPUs workshop, June 19

C++, CUDA Graphs, May 15

DARMA: A software stack model for supporting asynchronous, data effects programming, Apr 17

Asynchronous operations support in HiHAT, Mar 20

Hierarchy, Part II Feb 20

Data management interfaces: What we have now, user requirements, what we need, Jan 16, 2018

Hierarchy, Nov 21

Comparison SyCL and HiHAT; Supporting Hetero and Distributed Computing Through Affinity, Michael Wong, Oct 17

Recent developments, profiling, OpenMP, enumeration and other features, Sep 19

StarPU and HiHAT; HiHAT design ideas - Aug 15, 2017

More users, proof of concept plans, high-level design doc, June 20, 2017

Mini-Summit Synthesis, May 16, 2017

Partner interactions, NVIDIA usage models, AMD efforts, Apr 18, 2017

HiHAT overview, PaRSEC, Mar 21, 2017

OCR Review, Feb. 21, 2017

Community meeting, Jan 17, 2017

Kickoff, Dec. 20, 2016

Navigation menu

HHAT Usage Meeting Minutes

Hedgehog, Alex Bardakoff and Tim Blattner, Mar 17, 2020

CUDA Graphs characterization update, Feb 18, 2020

DaCe, CUDA Graphs, Jan 21, 2020

Graphs: HiHAT, QMCPACK, May 21, 2019

Graphs and interoperable building blocks, Apr 23, 2019

Graphs and tasking, Mar 12, 2019

Hierarchical decomposition, Oct 16, 2018

C++ Wrappers for HiHAT, Sep 18

Graphs: HIHAT and DMGR++, Aug 21

Dynamic tasking and HiHAT Graphs, July 17

Report out on C++ Layering on GPUs workshop, June 19

C++, CUDA Graphs, May 15

DARMA: A software stack model for supporting asynchronous, data effects programming, Apr 17

Asynchronous operations support in HiHAT, Mar 20

Hierarchy, Part II Feb 20

Data management interfaces: What we have now, user requirements, what we need, Jan 16, 2018

Hierarchy, Nov 21

Comparison SyCL and HiHAT; Supporting Hetero and Distributed Computing Through Affinity, Michael Wong, Oct 17

Recent developments, profiling, OpenMP, enumeration and other features, Sep 19

StarPU and HiHAT; HiHAT design ideas - Aug 15, 2017

More users, proof of concept plans, high-level design doc, June 20, 2017

Mini-Summit Synthesis, May 16, 2017

Partner interactions, NVIDIA usage models, AMD efforts, Apr 18, 2017

HiHAT overview, PaRSEC, Mar 21, 2017

OCR Review, Feb. 21, 2017

Community meeting, Jan 17, 2017

Kickoff, Dec. 20, 2016

Navigation menu

Search