HHAT Usage Meeting Minutes
This wiki page is used to keep minutes from phone and face to face meetings on the topic of usage models, user stories, and applications for heterogeneous hierarchical asynchronous tasking. Most recent meetings are listed on top. See [Presentations] for the ongoing agenda for monthly meetings and the materials that were posted.
A link to get back up to the parent page is here.
Asynchronous operations support in HiHAT, Mar 20
Presenter: Wojtek Wasko, NVIDIA
Attendees (24) included
- Andrew, Ashwin Aji, BillF, CJ, Dave Bernholdt, Dmitry, Gordon, Jesun Firoz, John Biddiscombe, Jose Monsalve (UD), Marcin Zalewski, Marcin, Mauro, Michael Wong, Mike Bauer, Piotr, Szilard, Umit, Walid Keyrouz, Wojtek Wasko, Wael, Wilf
- HiHAT Async Operations
- Background
- Reviewed actions, action handles, sync objects
- ActionHndl owned by HiHAT dispatch layer
- SyncObject and its format, management, interaction owned but pluggable implementation
- Requirements
- Action handles
- Used to link according to dependences
- Can logically combine (and/or) multiple ActionHndls into a single result - optimize for the common case of 1 reaching input dependence
- Can query state
- Can obtain underlying (post-dominating) sync object; debate: blocking or non-blocking wrt pluggable implementations
- Sync objects
- Based on object's description or type
- Must be able to completely bypass HiHAT and enable direct communication with legacy code
- Action handles
- Samples
- Provided in a CUDA context; we welcome examples and contributions for other architectures
- Discussion
- Support full async, e.g. in querying sync object that may not have been provided by underlying pluggable implementation?
- Can support both blocking query for sync object and non-blocking "is it available yet" API
- MPI erred on the side of adding more APIs, and had both blocking and non-blocking APIs
- Can always have a blocking implementation
- Preference was for a richer API
- Wilf: can alloc, movement, invocation, loading code in a deferred fashion? We might wish to not over-strain memory capacity for code and data, for example. CJ: Yes, everything is async and the underlying pluggable implementation makes those things happen when ready. There can be a backpressure at the enqueuing time, but we don't have a sample implementation of that yet.
- Background
- Upcoming community activities
- Workshop at CSCS in Zurich June 10-11. Likely focus is layering for async tasking, e.g. C++/HPX/maybe HiHAT/CUDA Graphs. Let us know if you're interested. John Biddiscombe of CSCS is hosting.
Hierarchy, Part II Feb 20
Attendees included
- CJ, John Stone, Antonino, Gordon, Kamil, Roman, Marcin, Mauro, Michael Garland, Michael Wong, Millad, Piotr, Siegried, Szilard, Walid, JamesB, Mauro, AndrewL, David Bernholdt, Dmitry, Jose, Oscar, MarcinZ, Stephen, rfvander, manz551, Wael Elwasif, Umit Catalyurek, John Biddiscombe, Wilf, Ashwin
- Recent community activities
- LifeSci Tech Summit at NVIDIA - discussions about CUDA Graphs
- OpenMP - interest in implementing HiHAT under OpenMP
- CSCS - considering a workshop on layering infrastructure under C++, e.g. HPX / HiHAT / CUDA Graphs. John Biddiscombe extended an invitation to help prepare for that.
- ECP Annual meeting - HiHAT was mentioned a couple of times as a possible candidate for underlying infrastructure, several side meetings, including for memory and storage interfaces and OpenMP
- Possible approaches for hierarchical dense linear algebra kernels (Vivek)
- Express code simply - flat or polyhedral frameworks
- AndrewL: is template metaprogramming considered simple? Vivek: was focused on representation vs. linguistic syntax, which is orthogonal
- Explore space of data distributions - hierarchical, block/cyclic, AoS/SoA
- Explore hierarchical decompositions - async, hierarchical place trees
- Can place onto various places in the hierarchical tree, thereby separating semantics from perf tuning
- Explore affinity hints/declarations - temporal locality
- Express code simply - flat or polyhedral frameworks
- ExaTensor: hierarchical processor of hierarchical tensor algebra - Dmitry Lyakh
- Hierarchical decomposition
- Replicate with prefetch
- Currently focused on block sparse, hierarchical, with no predefined regular pattern -> heuristics and predetermined data mappings
- "Longing for Portability of Performance" (Weather/climate) - Mauro Bianco
- Single source code
- Stencils with complex dependencies
- 10s-100s per time step, fairly big tasks but don't necessarily fill a GPU, more than one thread
- conditional execution, limited halo lines
- Data parallel and task-oriented
- From a dev point of view
- User-managed granularity is not portable - auto *splitting* isn't generalizable
- Aggregation seems more applicable - thread switching -> function calls, inlining. Still not universally optimal, e.g. parallel scan
- Express the *finest granularity*, else function call too big, coarsening with inlining is possible. Inspector/executor model? May need new *algorithmic patterns/motifs* to make data parallel, e.g. with a keyword.
- CJ: what are the prospects for expressing parallelism and data in a hierarchical way to a scheduler which can decompose as needed, vs. reactively? Limitations wrt particular application domains? Mauro: Sequoia tried to do some of this; limited progress. Vivek: compiler support helps with granularity, runtime helps with the mapping; expressing at fine-grained level
- Looking forward
- Transition from static to dynamic
Data management interfaces: What we have now, user requirements, what we need, Jan 16, 2018
Attendees included Wilf, George Bosilca, Gordon Brown, Jeff Larkin, Millad Ghane, John Stone, Jose Monsalve, Damien Genet, Michael Garland, Stephen Olivier, Bill Feiereisen, James Beyer, Antonino Tumeo, David Bernholdt, John Feo, Marcin, Szilard Pall, manz551, Siegfriend Benker
Interest in enumeration of resources: Gordon @ Codeplay
Feedback
- PNNL has a project for HIVE on Exascale platforms that may include GPUs. John Feo: Content here is relevant for scheduling, data marshaling.
- Umpire/CHAI guys not here today, but evaluating that
- Multi-dimensional support (George): may be done above the HiHAT layer, these interfaces seem adequate
- What about thread safety? (George)
- Often use memory contexts that are shared among threads on the same socket
- HiHAT clients (above) and coordinate between what's been registered, e.g. thread safe implementation, and what that client wants
- New implementations could be registered dynamically
Hierarchy, Nov 21
Attendees, included: James Beyer, Jeff Larkin, Max Grossman, Vivek Sarkar, David Bernholdt, Dmitry Lyakh, Michael Garland, Mike Bauer, Piotr, Ruyman, Reyes, Siegfriend Benner, Stephen Olivier, Szilard Pall, Wael, Elwasif, Wilf, Kamil Halbiniak & Roman, Millad Ghane, CJ Newburn
- Vivek Sarkar, GA Tech
- Habanero, CnC, OCR
- Places - can distribute data, can use type system to distinguish between local/global
- Locality-aware scheduling using hierarchical place tree
- Different abstractions for diff HW, e.g. diff # levels and kinds of memory
- Affinity annotations - can express preferences vs. hard assignments
- Can pass abstract work down the hierarchy, do work stealing at lower levels
- Supports spatial and temporal sharing
- Undirected graph, not just a tree; use trees where profitable
- Multiple levels of parallelism and heterogeneity
- Dmitry Lyakh, ORNL
- ExaTensor, a distributed tensor library based on hierarchical data representation
- Adaptive dynamic block-sparse representation of many-body tensors
- Resolution of each block is dynamically adjustable
- Recursive definition of storage, computational resources.
- Data centric - induces task decomposition. But tasks can follow data and be aggregated. Data storage granularity and task granularity are decoupled.
- Computational resources are encapsulated as virtual processors - can do linear algebra, tensor ops
- Targeted for Summit
- Mike Bauer, NVIDIA
- Legion
- Adaptive mesh refinement, algebraic multigrid
- Very dynamic behaviors - partitioning changes, data created and destroyed at runtime, depends on domain-specific knowledge
- Folks at LANL now working on AMR in Legion
- Hierarchy
- Levels that correspond to levels of details; may correspond to different data structures
- Hierarchical decomposition of the same data structures into different levels
- Requirements
- Want primitives for describing partitions efficiently and effectively - can be error prone
- Capture descriptions in DSL that apply across many different kinds of apps
- Discussion
- Provisional/on-demand decomposition
- ExaTensor: User specifies work to do in high-level DSL. Automatic decomposition, e.g. based on maintaining arithmetic intensity, data transfer bandwidths. Not much to trade-off since limited by DSL. Expected a need for further generalization as the scope of apps is expanded.
- Legion: Often 2 ways to decompose - 1) breadth, across nodes or processes within node, 2) depth, e.g. NUMA, across GPUs or SMs. Both may need to change dynamically and that needs to be efficient. No one size fits all. Apps describe the partitioning algorithms. Mappers pick the best decomposition for a given target. Tunable parameters are specified to the mapper.
- Vivek: dynamic code generation has a part to place; specialize to runtime-determined data distribution; has a student experimenting with the on CNN for inner loop bodies wrt data characteristics
- Follow up
- Pick some concrete examples and work them through
- Trade-offs based on cost models, e.g. recompute vs. fetch halo data
- MikeB: Recomputing could make code complexity high, hard to maintain. Could be relevant for Halide.
- Dmitry: compression could be a factor
- Jim Demmel @ Berkeley: Communication-avoiding algorithms
- Provisional/on-demand decomposition
Comparison SyCL and HiHAT; Supporting Hetero and Distributed Computing Through Affinity, Michael Wong, Oct 17
Attendees, including:
- Michael Wong, Andrew Lumsdaine, Carter Edwardsd, Jiri Dokulil, Kath Knobe, David Bernholdt, George Bosilca, Gordon Brown, Jeff Larkin, Marcin Zalewski, Mauro Bianco, Max grossman, M Copik, Michael Garland, Piotr, Ruyman Reyes, Siegfriend Benner, Thomas Herault, Wael Elwasif, Wilf, rfvander, Stephen Olivier, Szilard Pall, Wojtek Wasko, James Beyer, Sean Treichler, Bill Feiereisen, Dmitry Liakh, Jesun Firoz, dgenet, Millad, OscarH, Ashwin Aji
- SYCL
- Retargetable from C++, entirely standard C++ (no keywords, no pragmas)
- Single-source host and device compilation model
- Separate storage and access of data, specify where data stored/allocated
- Task graphs
- C++ lacks ABI, have to provide symbol name for kernel, could be improved with static reflection
- Several apps on top of SYCL now: vision for self-driving cars, ML with Eigen and Tensorflow, parallel STL with ranges, SYCL-BLAS, Game AI
- Comparison with Kokkos, HPX, Raja
- All: C++11/14, execution policies to separate concerns, shape of data, aiming to be subsumed by future C++
- SYCL: mem storage vs. data model, dependence graph, single source/multi-compilation
- Kokkos: mem space, layout space
- HPX: distributed computing nodes, execution policy with executors
- Raja: IndexSet, segments
- Proposing C++2020 hetero interface
- SYCL for HiHAT
- SYCL aligning more with C++ futures/executor/coroutines
- Exploring HiHAT vs. OpenCL as low-level interface, layered below ComputeCpp, which is target agnostic
- Plug in binary blobs for vendor-specific components
- Async API
- Enumeration of device-specific capabilities
- Time-constraint ops, for safety critical SYCL
- In context of safe and secure C++
- Removing ambiguities and undefined behaviors
- Componentized, multi-layer, well tested
- Could use for alloc, copy, invoke
- Codeplay biz and HiHAT
- Licensing, IP protection
- HiHAT in their stack
- Certification for HiHAT-compliant devices/implementations
- SC17 BoF on Distributed/Hetero C++ in HPC
- Workshop on Distributed/Hetero Programming in C++, IWOCL, Oxford
- C++ P07986r0: Support Hetero and Distr Computing Thru Affinity
- Resource querying - dynamic? hwloc, which is primarily hierarchical?
- Binding and allocation
- Affinity - relative? migration?
- CJ: Please extend to support an async interface to allocation and affinitization
- Wilf: app developer manages policies like affinity?
Recent developments, profiling, OpenMP, enumeration and other features, Sep 19
Attendees, including:
- Jeff Larkin, Francois Tessier, Ronak Buch, Marc Snir, Piotr, Samuel Thibault, Wael Elwasif, Wilf, Carter Edwards, John Stone, Benoit Meister, UD CIS, Marcin Zalesski (PNNL), Walid Keyrouz, Firo, Sunita Chandrasekaran, James Beyer, Stephen Olivier, Galen Shipman, Hartmut Kaiser, Jesun, Jiri Dokulil Stephen Jones, Mike Bauer, Millad Ghane, Siegfried Benner
- Recent developments
- Repo with CLA being readied at Modelado
- Good feedback and interactions at DoE Perf Portability Workshop in August, led to more of a common view, more collaboration
- HiHAT paper accepted at (Post-Moore's Era) PMES workshop at SC17
- Expecting a 2-hour meeting on HiHAT at SC17 to share progress, usage and plans for HiHAT
- Affinity BoF @ SC17 (Emmanuel Jeannot) looks to be relevant; let's plan some pre-work
- Profiling (Samuel Thibault)
- What can be profiled, what can be done (callbacks now, counters coming)
- Request: share input on additional states that should be profiled
- Walked through an example with different allocations, copies, invocation on CPU, GPU
- Showed integration with StarPU that uses Paje.trace and vite from its tool suite
- OpenMP above and below HiHAT (James Beyer)
- Use omp parallel for inside a task. Can warm up an OpenMP hot team with a separate invocation off of the critical path.
- Replacement of GOMP_parallel with HiHAT trampoline - HiHAT-based scheduler for improved retargetability
- OpenMP affinity - progressing from HiHAT ignoring it, being informed about it, and invoking from one set of threads to another set
- Resource subsets
- Looking at trying this out in Clang
- Enumeration and other features in HiHAT (CJ Newburn)
- Currently using inlined source code with static initializers, one for each kind of platform
- Integrating with hwloc to automate that. Will support NUMA nodes, CPUs and memories that hang off of them, accelerators (FPGA, GPU, DLA, PVA, etc.) and their memories
- Plugin of target-specific implementations
- Each API, at user and common layer
- Allocator, at each memory instance
- Selected by resources (e.g. copy endpoints) and a chooser
- Solicitations
- Language layering, e.g. task implementations
- Memory management, e.g. allocator tradeoffs - per-thread nurseries, same-size allocators, etc.
- Thread management, e.g. Qthreads, Argobots
- Client layering, e.g. SHMEM
StarPU and HiHAT; HiHAT design ideas - Aug 15, 2017
Attendees (33), including:
- Andrew Lumsdaine, Antonino Tumeo, Ashwin, Benoit Meister, CJ Newburn, D Genet, George Bosilca, Gordon Brown, James Beyer, Jose Monsalve, Kath Knobe, Marc Snir, Max Grossman, Michael Garland, Millad Ghane, Naoya Maruyama, Oscar Hernandez, Pall Szilard, Piotr, Ronak Buch, Samuel Thibault, Siegfried Benkner, Stephen Jones, Stephen Olivier, Sunita Chandrasekaran, Thomas Herault, Wael Elwasif, Wilf Pinfold, Wojciech Wasko
- StarPU, Samuel Thibault, INRIA
- Supports sequential semantics, general retargetability, out of core, clusters, memory consumption control
- Can simulate - as if performance
- HiHAT wish list
- Low-overhead interface to HW layers
- Reusable components - perf models, allocators, tracing, debugging
- Task-based interface, plus OpenMP, helpers for outlining and marshaling
- Interoperability
- Shared event management - user-defined events, interop with what's not covered by HiHAT
- Memory alloc - uniform low-level API, efficient sub-allocator, same-size memory pools, hierarchical balancing
- Disk support: store/key/value
- Prioritized actions
- Some responses
- Very good alignment with HiHAT
- Pluggable implementations can provide memory management; tuner can provide an allocator per memory device, or share them
- Event system will provide abstractions over implementation-specific events and semaphores in memory
- Simulation could be an interesting service above HiHAT
- Something above HiHAT is in control
- Emulation/simulation above HiHAT acts as is the actions really happened
- Possible joint investigations
- Whether anything special is needed for specialized allocators, memory load balancing
- Interested: Samuel, Marc, Benoit, George, Ashwin, Max
- Compare with MPI allocators that deal with fixed size and conflict avoidance, consider progress thread implications, developments in MPI 4 endpoints (Marc Snir)
- Tracing formats and debugging
- Interested: Samuel, Max
- Whether anything special is needed for specialized allocators, memory load balancing
- Implementation and interface exploration
- Prioritized actions
- Integration of MPI wait with action dependence system
- Interested: Marc, Jesus, Samuel, Andrew, George
- Discussion
- How do you handle running out of memory? See paper on memory control on StarPU website. Increase the granularity of what's submitted for execution.
- HiHAT design teasers, Sean Treichler and CJ Newburn
- Going stateless
- Resource handling
- Memory abstraction and traits
- Execution scopes
- Please send mail to cnewburn@gmail.com if you're interested in more offline discussion/presentation on these topics.
More users, proof of concept plans, high-level design doc, June 20, 2017
Attendees (43): Antonio Tumeo, Ashwin Aji, BillF, David Bernholdt, DebalinaB, DGenet, Dmitry Liakh, Firo017, Gordon Brown, Hans Johansen, James Beyer, Jesun (PNNL), Jiri Dokulil, John Stone, Kamil Halniniak and Roman, Kath Knobe, Keeran Brabazon (ARM), Mauro Bianco, Michael Garland, Mike Bauer, Mike Chu, Millad Ghane, Minu455, Naoya Maruyama, Piotr, Rob Neely, Ronak Buch, Ruyman Reyes, Samuel Thibault, SharanA (NVIDIA Tegra), Siegfried Benkner (U Vienna/StarPU), Stephen Olivier, Szilard Pall, Thomas Herault, Tim Blattner, Vincent Cave, Wael Elwasif (ORNL), CJ, ...
- Some new participants: NIST/HTGS, UINTAH, StarPU, more from NVIDIA, e.g. automotive
- Tim Blattner presented slides (see Presentations)
- Proof of concept
- Review of POC plan doc (see Presentations)
- John Stone, VMD and molecular orbitals
- Some discussion of the benefits of dynamic scheduling
- There's a value to progressive back off on the dynamism of scheduling, potentially based on profile-driven need - John, Szilard, Wilf
- High-level design doc (see Presentations)
Mini-Summit Synthesis, May 16, 2017
Attendees
Carter, David Bernholdt, George, Max, Michael Garland, Michael Robson (PPL/UIUC); Millad, Patrick, Piotr, Thomas Herault, Dmitry, Szilard, Toby, Wael, Damien, Andrew, Ashwin, Jiri Dokulil, Naoya Maruyama, Oscar, Pietro Cicotti, Wilf, CJ
- Welcome, intro
- DHPC++ review
- Compare/contrast with OpenCL, OpenVX, Vulcan
- Mini-Summit review
- Who gathered
- Slides should be integrated, some updates
- Overview
- Tabulation of results
- Review of poll/ratification
- This broader audience also ratified what was listed
- How do you connect different MPI worlds?
- Clarify that HiHAT has to stage data across sub-clusters
- Clarify granularity of work
- Sampling of requirements
- Active messages (PNNL, Andrew)
- Futures with data (OCR, Vincent; HPX, Hartmut Kaiser)
- Callbacks on completion (OCR, Vincent)
- Dynamic compilation (R-Stream, KART, LLVM)
- Graph reuse (SWIFT/QuickShed, Stephen) - later
- Partial I/O (SWIFT/QuickShed)
- Feedback for auto-tuning (TensorRT)
- Reproducibility via control
- Additional key issues to debate
- Who else should be drawn in
- OpenVX
- Vulcan
- StarPU
- UINTAH
- Topics for the future
- Portability, content of tasks - Carter
- Task scheduling for accelerators, SMP - Szilard
- Interoperation, remerging with other efforts, e.g. OpenCL, OpenMP - Szilard
- Performance analysis and monitoring APIs - Oscar
- Defining terms, e.g. future
Partner interactions, NVIDIA usage models, AMD efforts, Apr 18, 2017
Attendees (37) included: Wilf, Kath Knobe, Millad Ghane, CJ Newburn, Bill Feiereisen, Dmitry Liakh, Gordon Brown, Jesmin Tithi, Jans Johansen, Jiri Dokulil, John Feo, Kelly Livingston, Max grossman, Mauro Bianco, Oscar Hernandez, Patrick Atkinson, Piotr Luszczek, Ron Brightwell, Ruyman Reyes, Stephen Jones, Stephen Olivier, Sunita Chandrasekaran, Szilard Pall, Wael Elwasif, Ashwin Aji, George Bosilca, John Stone, Benoit Meister, Andrew Lumsdaine, Mike Bauer, Tim
CJ offered some recent highlights of partner interactions
- ARM, AMD, IBM, NVIDIA engaging
- User story, requirements and app updates
- Jim Phillips, NAMD; John Stone, VMD; Ronak Buch on Charm++, David Richards on transport Monte Carlo
- David Keyes, on categories of hierarchical algorithms
- DHPC++ workshop, Toronto, May 16; will be a talk on HiHAT https://easychair.org/cfp/dhpcc17
- Performance portability workshop, week of Aug 21, is expected to have some coverage of HiHAT
Upcoming HiHAT Mini-Summit
- See info here
Teaser on NVIDIA usage models, Stephen Jones, NVIDIA
- NVIDIA interested in HiHAT to broaden access of codes to resources in hetero platforms
- Also for AI: deep learning and inference have available tasked-based parallelism
- Offered some background on DNN, RNN
- As the lower bound on task granularity drops, more task parallelism may be available
- Two ways to leverage fine-grained tasks better:
- reduce overheads for actions like invocation and moving data, instigated by CPU and performed on GPU --> lower-overhead Common Layer
- aggregate tasks in sequences and sub-graphs, that are passed down to target for localized handling --> richer tasking abstractions
- Common requirements induced by inference and deep learning for HiHAT
Teaser on relevant AMD efforts, Ashwin Aji, AMD
- ROCm = Radeon Open Compute, rebranding of HSA
- Similar to common layer, thin API that abstracts underlying compute and memory HW
- Task descriptors, lock-free data structures, door bells that trigger task execution
- ATMI = Async Tasking and Memory Interface
- Kinds of tasks, low-latency signaling among tasks
- Links
- ATMI info: http://gpuopen.com/compute-product/atmi/
- ATMI github: https://github.com/RadeonOpenCompute/atmi
- ROCm platform info: http://gpuopen.com/compute-product/rocm/
- ROCR (Runtime) API: https://github.com/RadeonOpenCompute/ROCR-Runtime
HiHAT overview, PaRSEC, Mar 21, 2017
Attendees included Wilf Pinfold, Benoit Meister, Patrick Atkinson, Schumann, George Bosilca, Piotr Luszczek, JimPhillips, Stephen Olivier, Max Grossman, Bill Feiereisen, Dmitry Liakh, Wael Elwasif, Jiri Dokulil, Gordon Brown, John Stone, Andrew Lumsdaine, Thomas Herault, Ronak Buch, Ashwin Aji, Bala Seshasayee, Michael Garland, Damien Genet, Aurelien Bouteiller, Oscar Hernandez, PSZ - Paul Szillard?, Timo (Blue Brain), Kelly Livingston, Antonio Tumeo, CJ, several more
CJ gave a HiHAT Overview
- Progress in funding, e.g. from US government and vendors
- Several posts to web, including from PASC, Charm++, VMD, Habanero tasking micro-benchmark suite
- Upcoming report out on progress at GTC, morning of May 9 in San Jose
- Usage models and requirements
- Reveal initial progress on prioritized HiHAT interface design
- Highlighted SW architecture of HiHAT, especially regarding pluggable modules, user layer with target-specific decision making with ease of use, and common layer that dispatches to target-specific implementations of actions
- Call for more participation in identifying prioritized functionality of HiHAT to leverage, specific requirements and interfaces
George Bosilca of U Tennessee gave an overview of PaRSEC interaction with HiHAT
- Data-centric programming environment based on async tasks executing on a hetero distributed environment
- Offers a domain-specific language interface
- Delivers good performance and scalability
- SW architecture is based on modular component architecture of Open MPI, so it's quite amenable to plugging in HiHAT implementations for some of its functionality.
- Prioritized wish list
- Portable and efficient API/library for accelerator support - data movement, tasks
- Portable, efficient and inter-operable communication library (UCX, libFabric, …)
- Moving away from MPI will require an efficient datatype engine
- Also supported by rest of the software stack (for interoperability)
- Resource management/allocation system
- PaRSEC supports dynamic resource provisioning, but we need a portable system to bridge the gap between different programming runtimes
- Memory allocator: thread safe, runtime defined properties, arenas (with and without sbrk). (memkind?)
- Generic profiling system, tools integration
- Task-based debugger and performance analysis
Items for potential discussion and investigation
- Enumeration - look at interaction with HWLOC
- Dealing with unstructured data and data types
- Data versioning
- Serialized streams and subsequences of actions; may want cancellation
- Resilience - detection, propagation
- Interfaces for data movement, how that relates to MPI, collectives
OCR Review, Feb. 21, 2017
- Wilf: Presentation material out on the wiki: OCR usage models is the one for today
- Bala - OCR (Open Community Runtime), presents overview of OCR
- Wilf: How do you decide on granularity of the task breakdown for AutoOCR? Is there some sort of input file?
- Bala: Granularity is entirely the choice of the developer. AutoOCR is pretty straightforward - use a keyword to indicate that a task should be an EDT and annotate data blocks. Compiler will follow that and decorate with OCR API. It makes no decisions regarding granularity for itself. Compiler path is implemented in LLVM which looks at the keywords and generates OCR code.
- Wilf: With MPI-Lite can you get some resiliency that you can't get from MPI?
- Bala: That's interesting; we've not tried it. Resiliency & MPI-Lite have each been tried in isolation but not together.
- Stephen Jones: How do people usually port to OCR?
- Bala: People usually try to see if their MPI code can adapt to OCR. Will sacrifice performance while they see if they can implement in OCR. Some constructs like MPI_Wait are not aligned with OCR (which assumes an EDT can run to completion). Once people have adapted to OCR then there's no more reason to run MPI at all - they'll then restructure their program to reduce bottlenecks once they have a much better view of the dataflow graph.
- CJ: What about continuation-style semantics.
- Bala: A constant back-and-forth: should we stick to the "pure" model of no waits or stalls once a task has started? This would mean we need to split the task around a stall, but would also make data management complex between tasks. Some have looked at continuation semantics as a way to wait & context-switch within a task: moves the complexity into the runtime, which has to implement the continuation. Not many people have been trying this yet.
- CJ: That's what Argobots & Qthreads are going after. HiHAT is looking to layer these on top of it to manage such continuations.
- Bala presents on app requirements support
- Wilf: What's performance looking like right now for e.g. MPI-Lite? How heavy is the task-based overhead at this time?
- Bala: For MPI-Lite we've not put any effort into performance, because it's not trying to compete with MPI. OCR uses MPI for communication in this mode.Numbers look promising. At 16k cores OCR does not appear to perform any worse than MPI.
- Wilf: How does resiliency play into this, if you've got 16k cores for example?
- Bala: Not tried it at that scale yet. It will obviously slow things down. Has been tried out in isolation but not mixed together with performance yet.
- Wilf: What about load-balancing? Was that 16k run fairly regular?
- Bala: Again, have not yet tried this out in an application. In isolation, have used it at 64-node scale.
- Have tried it out with Mini-AMR and seen some good results but still wrestling with heuristics that are needed. More heuristic intelligence does not seem to provide a lot of benefit because of the overhead of coming up with intelligent heuristics.
- Stephen Olivier: Do you have any full-sized apps you have results for?
- Michael Wong: Do you have a regular OCR call?
- MW: Have you looked at any bottlenecks inside OCR?
- Bala: One of the things we're already aware of is the GUID implementation. Making it globally unique can be expensive and in practice you don't always need it to be truly global around the cluster: you only need uniqueness spatially or temporally. Suggests two types of GUID: truly global, and then more local UID.
- Can also probably shave off some overhead in event management (Legion has managed this, for example). You can often re-use events without the overhead of creation/destruction.
- Wilf: Here's where we are with the meetings
- We've been using EventBrite for registration but it's getting a bit awkward. Trying to move over to MailChimp. We've got about 69 in the group (30 on the call today).
- Everyone will receive an email in the next week for registration. Use that to register, not EventBrite, in future please.
- Wiki will be kept with link to database of MailChimp info
- CJ: Some higher comments & contexts
- Upcoming talks will look at the apps/algos which will be layered on top of HiHAT.
- Lots of good work in progress - appreciate people contributing and sharing
- Michael Wong: One thing he's looking at is developing heterogeneous C++. If the group is interested he can send out some information about that. Also going to be running a workshop on ISO C/C++ and other high level heterogeneous C++ programming models here.
- CJ: Want to look at these things and decide "would these be called BY HiHAT, or built on top of HiHAT?"
- MW: Do have models which can build on top of HiHAT. Can have discussion at a later meeting.
Community meeting, Jan 17, 2017
Agenda
- Welcome: Wilf Pinfold
- Overview, purpose
- Solicit apps that need hierarchical tasking
- Solicit usage models
- Fully dynamic to semi-static - Pall
- Solicit user stories (requirements)
- Map tasks to multiple GPUs - Dmitry
- Granularity - Pall
- Finite memory - Carter; see "Sandia" on Applications page
- Distributed data structures in finite memory - Toby
- For latency sensitivity apps, anything overheads need to be offset by significant gains - Pall
- Hierarchical topology - Toby
- Building libs for finite physical memory; libs cooperating with caller, e.g. via callbacks - John Stone
- Aggregated task groups, recursive task model that enables decomposition - Dmitry, Ashwin Aji
- Data affinity-driven binding and scheduling and data decomposition - Pall
- Move work to data vs. other way around - John
- PGAS support, data affinity and decomposition - Toby
- Housekeeping - Wilf
Participants included: Wilfred Pinfold - creator, John Stone, umit@gatech.edu, Wael Elwasif, xg@purdue.edu Xinchen Guo, belak1@llnl.gov, Ruymán Reyes, pa13269@bristol.ac.uk Patrick Atkinson, Max Grossman, gordon@codeplay.com, bala.seshasayee@intel.com, mbianco@cscs.ch, ashwin.aji@amd.com, khalbiniak@icis.pcz.pl - Kamil Halbiniak, roman@icis.pcz.pl - Roman Wyrzykowski, fabien.delalondre@epfl.ch, richards12@llnl.gov, pszilard@kth.se - Pall, Michael Wong, Shekhar Borkar, David Bernholdt, rabuch2@illinois.edu, bill@feiereisen.net, cnewburn@nvidia.com, Piotr Luszczek, liakhdi@ornl.gov, Muthu Baskaran, jesmin.jahan.tithi@intel.com, slolivi@sandia.gov, hcedwar@sandia.gov - Carter, fuchst@nm.ifi.lmu.de - Toby, rbbrigh@sandia.gov - Ron
Signed up, but seemed not to make it: timothy.g.mattson@intel.com, schulzm@llnl.gov, oscar@ornl.gov[conflict], mbauer@nvidia.com, romain.e.cledat@intel.com, aiken@cs.stanford.edu, mfarooqi14@ku.edu.tr, lopezmg@ornl.gov, Benoit Meister, vgrover@nvidia.com, kelly.a.livingston@intel.com, alexandr.nigay@inf.ethz.ch, matthieu.schaller@durham.ac.uk, manjugv@ornl.gov, esaule@uncc.edu, schandra@udel.edu, cychan@lbl.gov, gshipman@lanl.gov, mgarland@nvidia.com, vsarkar@me.com, Didem Unat, maria.garzaran@intel.com, john.feo@pnnl.gov, mike.chu@amd.com, timothee.ewart@epfl.ch, jim@ks.uiuc.edu, n-maruyama@acm.org, pcicotti@sdsc.edu, kk13@rice.edu, srajama@sandia.gov
Kickoff, Dec. 20, 2016
Agenda
- Welcome: Wilf Pinfold
- Overview, purpose
- Approach
- Wiki explanation
- Next steps
- Feedback, expression of interest
Participants (33) included
BillF, CarterE, DavidR, Erik, JimP, KamilH & RomanW, PatrickA, PietroC, SenT, ShekharB, CJ, WilfP, StephenJ, XinchenG, TimM, RomainC, OscarH, AlexandrN, VinodG, KathK, Ashwin Aji, JoshF, GalenS, ManjuG, PallS, MariaG, ... See calendar entry, if you signed up
Discussion
- Glossary suggested by Tim, try not to invent new definitions
- Report suggested by Oscar - summary of usage cases could be useful for DoE
- How do we keep from getting fragmented? (Tim) Try to bringing community together by focusing on common requirements (Wilf)
- Start with usage models, requirements, provisioning constraints, rather than comparing and contrasting specific implementations
- We have data and experience to share
- Looking to have a phone meeting 3rd Tue each month at 9am PST; some here had standing conflicts; Wilf to try a Doodle poll
- Time scale, involvement, outputs?
- Are we sold on async tasking? Driven more by efficiency on HW? (Shekhar) Yes (Oscar) Who needs it for what? We need compelling examples of where mainline DoE apps need it. (Dave Richards) Clever use of MPI goes a long way (Tim)
- MPI: resilience not well addressed (Wilf) Comparison with MPI is inappropriate, tasking can be done on top of MPI, e.g. two-hot, accelerated MD. It's about the benefit of a computational model, which helps some and not others. (Galen) Tim agrees that MPI is low-level runtime.
- Interesting to identify a set of apps that embody tasking, and understand why they chose that model (Galen) Sounds like a potential value proposition (Shekhar).
- Characteristics: granularity of tasks - the finer the granularity the less portable the solution, explicit vs. implicit control (DaveR) If task relationships can be described, it can become more portable (Stephen) How will decomposition happen - expert, compiler, runtime? (DaveR)
- How do we make this applicable to large, portable code bases, enabling productivity? Where does the tasking model emerge? (DaveR)
- What does it mean to have an async environment, what are the critical features? (Josh)
- The way to resolving differences at various levels may lie in hierarchy (Kath) Strongly agree with hierarchy (Tim)
- Strongly agree with a bottom up approach, with a hierarchical perspective (Tim)