# Multiform Logical Time for MeMo-Codesign Robert de Simone Inria Sophia-Méditerranée & Université Code d'Azur ## INTRODUCTION ## Today 14th ACM-IEEE International Conference on Formal Methods and Models for (Embedded) System Design Until 2013 was Formal Methods and Models for (Hardware-Software) Codesign - a) Hardware Design has traditional Models for Synthesis (up to a point) - b) System Engineering has its methodologies and diagrams But Software Modeling and Formal Methods do not really match mainstream Software Engineering practices So <u>what is relevant about Software Modeling for Embedded/CPS system design?</u> Or what <u>could be</u>? ## **OUTLINE** - 1. System-level, Model-driven, Platform-Based design - 2. Around Hardware CAD and Synthesis - 3. Around Software Engineering (MoCs vs MoPs) - 4. A Clock Constraint Specification Language - 5. Conclusion and discussion MemoCode 2016 # System-level, Model-driven, Platform-Based design Multiviews, Cyber-Physical and Systems of Systems ## Y-Chart design approach Also branded as: Application / Architecture Adaptation (AAA) where adaptation means optimized mapping, and mapping covers allocation in space and scheduling in time Actually used for design space exploration in distinct engineering purposes, at various modeling levels. ## **Targeting Hardware: Platform-Based Design** Virtual Platforms and Virtual prototypes assembled by Hardware Architects. Application model is here *only* to provide typical use cases (booting Linux, one-day battery life) only to exercice the platform and accumulate non-functional information (typically by simulation in SystemC) the version of Alberto Sangiovanni-Vincentelli (UC Berkeley + MARCO Giga-Scale Silicon Research Center ## **Application / Architecture Adaptation (AAA)** Here mapping corresponds to abstract compilation onto parallel heterogeneous architectures. Typically used in **Real-Time Scheduling and Reactive programming**, to get logical timing guarantees Q: can clever mapping make a non time-predictable platform act more predictably? ## A two-storey approach First Floor Requirements **Provisions** Application **Architecture** model model Full platform Complex appli mapping Regular Library kernels Core or thread **Processing** mapped Task Element The AAA approach cannot span the complete design Predicable WCET and WCCT computed at a component level (which?) Working at Ground-level more messy (less abstract) needs to be applied on regular subcases only. Level separation may fluctuate to adjust the approach **Ground Floor** Inria\_ ## Adjust modeling to analysis (and vice-versa) What we want: Logical clock constraints + simple formulas between non-functional values Automatic solver or abstract (regular) simulation for optimization Before we can get this (in a realistic fashion?) we need more feedback How are similar things done elsewhere? Can we borrow, compare, contradict??? Let's embark on a Tour. ## research communities A hardware/software stack System Engineering Formal Methods and Concurrency Theory (MoCCs) (Domain Specific) Language design Parallel compilation Runtime execution and Optimization Simulation and Worst-Case Execution Time Hardware Abstraction Layer ## **Example model-driven platform-based AAA environments** ## Early system-level design stages (specification) - SysML/MARTE (AADL) - ARCADIA / CAPELLA - Amalthea ## Hardware Virtual platforms (MoCs and SoCs) - Synopysys MCO Platform Architect - MetroPolis - StreamIt/Raw (Tilera) - SDF3/Aelite - ForSyDE/Nostrum - SigmaC/Kalray ## **Real-Time Scheduling** - AADL (SysML/MARTE) - SynDEx Much more often, the combination remains implicit... ## System Engineering Design Flow: Arcadia / Capella example requirement elicitation functional specification refinement architectural specification refinement ## System Engineering Design Flow: Arcadia / Capella example Allocation made by user, on software and hardware architectur Quality of allocation evaluated by computing simple cost functions: **Excel-like spreadsheets mostly** formulas ←→ constraints #### Example: - (money) cost - mass - reliability (fault tolerance) - security (mixed-criticality) But what if dynamics involved? Mode&State changes ## SysML/MARTE SysML parametrics: formulas expressing physical laws (or CPS ones) ## **AADL** User provides allocations Analysis tools compute end-to-end latency and other performance measurements ## Example: Amalthea (Bosch et al) ## **Example: Synopsys Platform Architect** Many other attempts: Metropolis StreamIt/Raw/Tilera SDF3/Aelite ForsyDE/Nostrum SigmaC/Kalray . . ## Lessons Applications and Architectures should be independently described, then (only) fitted together: - an application may be mapped to multiple execution platforms - a SW/HW platform is versatile and supports many application At high-level, simple cost function formulas may be in order, at lower-level a more dynamic simulation relation needed, and if applications have static control the difference is less Full control over the (closed) system is assumed. In our case: **High-Performance or Real-Time Embedded Programming (HPeC)** 2 Around Hardware Computer-Aided Design Sous-titre facultatif ## A tradition of models Programmer View (SW bit-accurate types) ### SystemC #### **Transaction-Level** - accurate time - approximate time - loosely timed simulation modes Cycle-Accurate (SW bit-accurate types) VHDL, Verilog RTL (HW data types) Logical synthesis Logical gates (discrete Boolean) Spice Electric (continuous) New issues in modeling and simulation ## An example many-core time-deterministic processor ## MPPA®-256 Integrated Manycore Processor Core ## Cluster Processor - 5-issue VLIW architecture - Predictability & energy efficiency - FPU: 32bits / 64 bits IEEE 754 - MMU for rich OS support - 16 + 1 cores - NoC Tx and Rx interfaces - 2 MB of shared memory - 2 NoC to connect clusters and I/O - 47MB of on-chip Memory - 4x Quad core SMP on each of the four I/O subsystems #### Execution time predictability - "Timing Compositional Core", no timing anomalies (Wilhelm) - Cluster configuration where each memory bank accessed by 1 core - Network calculus applied to the Data NoC to bound transfer latency (nría ## A large audience example ## **Knights Landing Overview** Chip: 36 Tiles interconnected by 2D Mesh Tile: 2 Cores + 2 VPU/core + 1 MB L2 Memory: MCDRAM: 16 GB on-package; High BW DDR4: 6 channels @ 2400 up to 384GB IO: 36 lanes PCIe Gen3. 4 lanes of DMI for chipset Node: 1-Socket only Fabric: Omni-Path on-package (not shown) Vector Peak Perf: 3+TF DP and 6+TF SP Flops Scalar Perf: ~3x over Knights Corner Streams Triad (GB/s): MCDRAM: 400+; DDR: 90+ Source Intel: All products, computer systems, dates and figures specified are preliminary based on current are subject to change without notice. KNL data are preliminary based on current expects without notice. 1Binary Compatible with Intel Xeon processors using Haswell L numbers are based on STREAM-like memory access pattern, estimated based on internal Intel analysis ## An example Processor for car applications ## Built for real-time determinism ## big. LITTLE alternative mappings What form of info does one need to organize scheduling? (cf. upcoming talk by Emilien Kofman) Inria ## Big issues in practice are - Overlapping communications with computations - Dealing with different (physical) clock speeds Processor cores run faster than bus/NoC, than external memory Dealing with performance and low-power (and temperature) Different clock domains and power domains dark silicon, DVFS... ## Ambition is to make the architectural platform model time-predictable (as much as possible) - Either by supposing the real architecture is - Or by using it in a way that computations and communications fit inside their alloted processing and interconnect resources Avoiding cache misses Avoiding media contention Requires again static, data-independent control in applications: static planning ## Simple relations between dimensions, here again? (Hardware Architects use Excel for Timing-Closure )... MemoCode 2016 ## Sketch: just enough SoC structure (to support annotation and build constraints) # 3 ## **Around Software Engineering** **Programming Models for parallel processing** MemoCode 2016 ## **Parallel Programming Models** ## OpenMP (shared memory) - 1. Annotations to instruct the compiler on potential parallelism(s): - 2. Parallel for-loops, regions, synchro barriers (ingredients of static oontrol programs) - 3. With successive versions, always more annotation types (simd, target device) - 4. Also more and more ways to inquire the platform about its dimensions and set up affinities - → ways to instruct "some" mapping "by hand" ## OpenCL (data parallelism - 1. Programs split as sets of kernels/tasks, to be applied in a data-parallel fashion - 2. Same remark as for OpenMP (4) above ## MPI (distributed memory task parallelism and streaming) - 1. Networks of (parallel) Processes with message-passing - 2. Description of interconnect topology for platform (regular graphs) - 3. Placement done by compiler (obscure, may use affinity) In all cases ,generality (no global static control assumption) is desired. Some mapping (affinity) and scheduling always done, but without any search for optimality. No orthogonality (architecture description hinted inside application) ## **OpenMP** Multiprocessing - Originally, annotation pragmas to indicate which for-loops could be executed in parallel (if distinct iterations dataindependent) - Supposedl equivalent to sequential form, and shared memory (unlike MPI) - Issue: no real check that pragmas are correct (remember polyhedral model a few foils ago). - New in v4: Tasks (so that scheduling is <u>really</u> dynamic); goes again the idea of planning before-hand (compilation) - More and more pragmas (simd, target...) are letting the actual (supposed) architecture crawl into the program description (no orthogonality) Appli should first be archi agnostic, then only made archi aware by compilation #### Runtime Library Routines for C/C++ Execution environment routines affect and monitor threads, processors, and the parallel environment. The library routines are external functions with "C" linkage. #### **Execution Environment Routines** #### omp\_set\_num\_threads [3.2.1] [3.2.1] Affects the number of threads used for subsequent parallel regions not specifying a num\_threads clause, by setting the value of the first element of the nthreads-var ICV of the current task to num\_threads. void omp\_set\_num\_threads(int num\_threads); #### omp\_get\_num\_threads [3.2.2] [3.2.2] Returns the number of threads in the current team. The binding region for an omp\_get\_num\_threads region is the innermost enclosing parallel region. int omp\_get\_num\_threads(void); #### omp\_get\_max\_threads [3.2.3] [3.2.3] Returns an upper bound on the number of threads that could be used to form a new team if a parallel construct without a num\_threads clause were encountered after execution returns from this routine. int omp\_get\_max\_threads(void); #### omp\_get\_thread\_num [3.2.4] [3.2.4] Returns the thread number of the calling thread within the current team. int omp\_get\_thread\_num(void); #### omp\_get\_num\_procs [3.2.5] [3.2.5] Returns the number of processors that are available to the device at the time the routine is called. int omp\_get\_num\_procs(void); #### omp\_in\_parallel [3.2.6] [3.2.6] Returns true if the active-levels-var ICV is greater than zero; otherwise it returns false. int omp\_in\_parallel(void); #### omp\_set\_dynamic [3.2.7] [3.2.7] Enables or disables dynamic adjustment of the number of threads available for the execution of subsequent parallel regions by setting the value of the dyn-wor ICV. void omp\_set\_dynamic(int dynamic threads); #### omp\_get\_dynamic [3.2.8] [3.2.8] This routine returns the value of the dyn-var ICV, which is true if dynamic adjustment of the number of threads is enabled for the current task. int omp\_get\_dynamic(void); #### omp\_get\_cancellation [3.2.9] [3.2.9] Returns the value of the concel-var ICV, which is true if cancellation is activated; otherwise it returns false. int omp\_get\_cancellation(void); #### omp\_set\_nested [3.2.10] [3.2.10] Enables or disables nested parallelism, by setting the nest-wor ICV. void omp\_set\_nested(int nested); #### omp\_get\_nested [3,2,11] [3,2,10] Returns the value of the nest-var ICV, which indicates if nested parallelism is enabled or disabled. int omp\_get\_nested(void); #### omp\_set\_schedule [3.2.12] [3.2.12] Affects the schedule that is applied when runtime is used as schedule kind, by setting the value of the run-sched-vor ICV. void omp set schedule(omp sched t kind, int chunk size); kind: One of the following, or an implementation-defined schedule: omp\_sched\_static = 1 omp\_sched\_dynamic = 2 omp\_sched\_guided = 3 omp\_sched\_auto = 4 #### omp\_get\_schedule [3,2,13] [3,2,13] Returns the value of run-sched-var ICV, which is the schedule applied when runtime schedule is used. void omp\_get\_schedule{ omp\_sched\_t \*kind, int \*chunk\_size}; See kind for omp\_set\_schedule. #### omp\_get\_thread\_limit [3.2.14] [3.2.14] Returns the value of the thread-limit-var ICV, which is the maximum number of OpenMP threads available. int omp\_get\_thread\_limit(void); #### omp\_set\_max\_active\_levels [3.2.15] [3.2.15] Limits the number of nested active parallel regions, by setting max-active-levels-var ICV. void omp\_set\_max\_active\_levels(int max\_levels); #### omp\_get\_max\_active\_levels [3.2.16] [3.2.16] Returns the value of max-active-levels-var ICV, which determines the maximum number of nested active parallel regions. int omp\_get\_max\_active\_levels(void); #### omp\_get\_level [3.2.17] [3.2.17] For the enclosing device region, returns the levels-wars ICV, which is the number of nested parallel regions that enclose the task containing the call. int omp\_get\_level(void); #### omp\_get\_ancestor\_thread\_num [3.2.18] [3.2.18] Returns, for a given nested level of the current thread, the thread number of the ancestor of the current thread. int omp\_get\_ancestor\_thread\_num(int level); #### omp\_get\_team\_size [3.2.19] [3.2.19] Returns, for a given nested level of the current thread, the size of the thread team to which the ancestor or the current thread belongs. int omp\_get\_team\_size(int level); #### omp get active level (3.2.20) (3.2.20) Returns the value of the active-level-wars ICV, which determines the number of active, nested parallel regions enclosing the task that contains the call. int omp\_get\_active\_level(void); #### omp\_in\_final [3.2.21] [3.2.21] Returns true if the routine is executed in a final task region; otherwise, it returns false. int omp\_in\_final(void); #### omp\_get\_proc\_bind [3.2.22] [3.2.22] Returns the thread affinity policy to be used for the subsequent nested parallel regions that do not specify a proc\_bind clause. omp\_proc\_bind\_t omp\_get\_proc\_bind(void); Returns one of: omp\_proc\_bind\_false = 0 omp\_proc\_bind\_true = 1 omp\_proc\_bind\_master = 2 omp\_proc\_bind\_close = 3 omp\_proc\_bind\_spread = 4 #### omp\_get\_ num\_places [3.2.23] Returns the number of places available to the execution environment in the place list. int omp\_get\_num\_places(void); #### omp\_get\_place\_num\_procs [3.2.24] Returns the number of processors available to the execution environment in the specified place. int omp\_get\_place\_num\_procs(int place\_num); #### omp\_get\_place\_proc\_ids [3.2.25] Returns the numerical identifiers of the processors available to the execution environment in the specified place. void omp\_get\_place\_proc\_ids( int place\_num, int \*ids); #### omp\_get\_place\_num [3.2.26] Returns the place number of the place to which the encountering thread is bound. int omp\_get\_place\_num(void); #### omp\_get\_partition\_num\_places [3.2.27] Returns the number of places in the place partition of the innermost implicit task. int omp\_get\_partition\_num\_places(void); #### omp\_get\_partition\_place\_nums [3.2.28] Returns the list of place numbers corresponding to the places in the place-partition-var ICV of the innermost implicit task. void omp\_get\_partition\_place\_nums(int \*place\_nums); #### OpenCL API Reference #### The OpenCL Platform Layer The OpenCL platform layer implements platform-specific features that allow applications to guery OpenCL devices, device configuration information, and to create OpenCL contexts using one or more devices. Items in blue apply when the appropriate extension is supported. #### Querying Platform Info & Devices [4.1-2] [9.16.9] cl int clGetPlatformIDs (cl uint num entries, cl\_platform\_id \*platforms, cl\_uint \*num\_platforms) cl int clicdGetPlatformIDsKHR (cl uint num entries, cl\_platform\_id \* platfoms, cl\_uint \*num\_platforms) cl\_int clGetPlatformInfo (cl\_platform\_id platform, cl\_platform\_info param\_name size\_t param\_value\_size, void \*param\_value, size t\*param value size ret) param\_name: CL\_PLATFORM\_{PROFILE, VERSION}, CL PLATFORM {NAME, VENDOR, EXTENSIONS}, CL\_PLATFORM\_ICD\_SUFFIX\_KHR [Table 4.1] cl int clGetDeviceIDs (cl platform id platform, cl device type device type, cl uint num entries, cl\_device\_id \*devices, cl\_uint \*num\_devices) CL\_DEVICE\_TYPE\_{ACCELERATOR, ALL, CPU}, CL DEVICE TYPE {CUSTOM, DEFAULT, GPU} cl int clGetDeviceInfo (cl device id device, cl device info param name, size\_t param\_value\_size, void \*param\_value, size\_t \*param\_value\_size\_ret) param\_name: [Table 4.3] CL\_DEVICE\_ADDRESS\_BITS, CL\_DEVICE\_AVAILABLE, CL\_DEVICE\_BUILT\_IN\_KERNELS, CL\_DEVICE\_COMPILER\_AVAILABLE, DEVICE {DOUBLE, HALF, SINGLE} FP CONFIG, CL DEVICE ENDIAN LITTLE, CL DEVICE EXTENSIONS, CL\_DEVICE\_ERROR\_CORRECTION\_SUPPORT, CL DEVICE EXECUTION CAPABILITIES CL DEVICE GLOBAL MEM CACHE {SIZE, TYPE}. CL\_DEVICE\_GLOBAL\_MEM\_{CACHELINE\_SIZE, SIZE}, CL DEVICE GLOBAL VARIABLE PREFERRED TOTAL SIZE, CL DEVICE PREFERRED (PLATFORM, LOCAL, GLOBAL}\_ATOMIC\_ALIGNMENT, DEVICE GLOBAL VARIABLE SHARING, CL DEVICE HOST UNIFIED MEMORY, CL DEVICE IMAGE MAX {ARRAY, BUFFER} SIZE, CL\_DEVICE\_IMAGE\_SUPPORT, CL\_DEVICE\_IMAGE2D\_MAX\_{WIDTH, HEIGHT}, CL DEVICE IMAGE3D MAX {WIDTH, HEIGHT, DEPTH}, CL\_DEVICE\_IMAGE\_BASE\_ADDRESS\_ALIGNMENT, CL\_DEVICE\_IMAGE\_PITCH\_ALIGNMENT. CL\_DEVICE\_LINKER\_AVAILABLE, CL DEVICE LOCAL MEM {TYPE, SIZE} DEVICE MAX READ IMAGE ARGS CL DEVICE MAX WRITE IMAGE ARGS CL\_DEVICE\_MAX\_{CLOCK\_FREQUENCY, PIPE\_ARGS}, CL\_DEVICE\_MAX\_{COMPUTE\_UNITS, SAMPLERS}, CL DEVICE MAX CONSTANT {ARGS, BUFFER SIZE} CL DEVICE MAX {MEM ALLOC, PARAMETER} SIZE, CL DEVICE MAX GLOBAL VARIABLE SIZE, CL DEVICE MAX ON DEVICE (QUEUES, EVENTS), CL\_DEVICE\_MAX\_WORK\_GROUP\_SIZE, CL\_DEVICE\_MAX\_WORK\_ITEM\_{DIMENSIONS, SIZES}, CL\_DEVICE\_MEM\_BASE\_ADDR\_ALIGN, CL\_DEVICE\_NAME CL\_DEVICE\_NATIVE\_VECTOR\_WIDTH\_{CHAR, INT}, CL DEVICE NATIVE VECTOR WIDTH (LONG, SHORT). CL\_DEVICE\_NATIVE\_VECTOR\_WIDTH\_{DOUBLE, HALF}, CL\_DEVICE\_NATIVE\_VECTOR\_WIDTH\_FLOAT, CL\_DEVICE {OPENCL C VERSION, PARENT DEVICE}. CL\_DEVICE\_PARTITION\_AFFINITY\_DOMAIN, DEVICE PARTITION MAX SUB DEVICES, CL DEVICE PARTITION {PROPERTIES, TYPE} CL\_DEVICE\_PIPE\_MAX\_ACTIVE\_RESERVATIONS. CL\_DEVICE\_PIPE\_MAX\_PACKET\_SIZE, CL\_DEVICE\_{PLATFORM, PRINTF\_BUFFER\_SIZE}; CL\_DEVICE\_PREFERRED\_VECTOR\_WIDTH\_{CHAR, INT}, CL\_DEVICE\_PREFERRED\_VECTOR\_WIDTH\_DOUBLE, CL DEVICE PREFERRED VECTOR WIDTH HALF, CL\_DEVICE\_PREFERRED\_VECTOR\_WIDTH\_LONG, CL\_DEVICE\_PREFERRED\_VECTOR\_WIDTH\_SHORT, CL\_DEVICE\_PREFERRED\_VECTOR\_WIDTH\_FLOAT, CL\_DEVICE\_PREFERRED\_INTEROP\_USER\_SYNC, CL\_DEVICE\_PROFILE, CL\_DEVICE\_PROFILING\_TIMER\_RESOLUTION, CL\_DEVICE\_SPIR\_VERSIONS, CL\_DEVICE\_QUEUE\_ON\_DEVICE\_PROPERTIES, CL DEVICE QUEUE ON HOST PROPERTIES, CL\_DEVICE\_QUEUE\_ON\_DEVICE\_MAX\_SIZE, CL\_DEVICE\_QUEUE\_ON\_DEVICE\_PREFERRED\_SIZE, CL DEVICE {REFERENCE COUNT, VENDOR ID}, CL\_DEVICE\_SVM\_CAPABILITIES, CL\_DEVICE\_TERMINATE\_CAPABILITY\_KHR, CL DEVICE {TYPE, VENDOR}. #### CL\_{DEVICE, DRIVER}\_VERSION Partitioning a Device [4.3] cl\_int clCreateSubDevices (cl\_device\_id in\_device, const cl\_device\_partition\_property \*properties, cl\_uint num\_devices, cl\_device\_id \*out\_devices, cl uint \*num devices ret) properties: CL DEVICE PARTITION EQUALLY. CL\_DEVICE\_PARTITION\_BY\_COUNTS CL DEVICE PARTITION BY AFFINITY DOMAIN cl\_int clRetainDevice (cl\_device\_id device) cl int clReleaseDevice (cl device id device) #### Contexts [4,4] #### cl context clCreateContext ( const cl\_context\_properties \*properties, cl\_uint num\_devices, const cl\_device\_id \*devices, void (CL\_CATLBACK\*pfn\_notify) (const char \*errinfo, const void \*private\_info, size t cb, void \*user data), void \*user data, cl int \*errcode ret) #### The OpenCL Runtime API calls that manage OpenCL objects such as command-queues, memory objects, program objects, kernel objects for \_\_kernel functions in a program and calls that allow you to enqueue commands to a command-queue such as executing a kernel, reading, or writing a memory object. #### Command Queues [5.1] #### cl command queue clCreateCommandQueueWithProperties ( cl\_context context, cl\_device\_id device, const cl command queue properties \*properties, cl\_int \*errcode\_ret) properties: [Table 5.1] CL QUEUE SIZE. CL QUEUE PROPERTIES (bitfield which may be set to an OR of CL\_QUEUE\_\* where \* may be: OUT\_OF\_ORDER\_EXEC\_MODE\_ENABLE, PROFILING ENABLE, ON DEVICE DEFAULT) cl\_int clRetainCommandQueue ( cl command queue command queue) #### cl\_int clReleaseCommandQueue ( cl command queue command queue) #### cl int clGetCommandQueueInfo ( cl command queue command queue, cl\_command\_queue\_info param\_name, size\_t param\_value\_size, void \*param\_value, size\_t \*param\_value\_size\_ret) param\_name: [Table 5.2] CL\_QUEUE\_CONTEXT, CL\_QUEUE\_DEVICE, CL\_QUEUE\_SIZE, CL QUEUE REFERENCE COUNT, CL\_QUEUE\_PROPERTIES ## **OpenCL** y and explicit data transfer for GPU verfeed the processing elements, and nave arrived (finger-crossed) computation and communication ## Message-Passing Interface (MPI) - Highly used for High-Performance Computing (grids, clusters) - Mostly message-passing and synchronization primitives, to be applied with any general-purpose sequential language (in fact Fortran or C++ generally) - Point-to-point or collective (broadcast, all-to-all) communications - Means to describe virtual network topology as graph - Big issue is to assign processes to processors (statically?), and to map the virtual communication onto the real interconnect infrastructures - Both prominent frameworks (OpenMPI and MPICH) have fancy dedicated library, commercial offers are made by specific hardware vendors - Mapping done at runtime, but what about static control processes and static mapping at compile time? ## Communicators with Topology ## **MPI** quick card (excerpt) - •Create with cartesian topology. (§ 6.5.1) - •int MPI\_Cart\_create (MPI\_Comm comm\_old, - •int ndims, int \*dims, int \*periods, int - •reorder, MPI\_Comm \*comm\_cart) - •Suggest balanced dimension ranges. (§ 6.5.2) - •int MPI\_Dims\_create (int nnodes, int - •ndims, int \*dims) - •Determine rank from cartesian coordinates. (§ 6.5.4) - •int MPI\_Cart\_rank (MPI\_Comm comm, int - \*coords, int \*rank) - •Determine cartesian coordinates from rank. (§ 6.5.4) - •int MPI\_Cart\_coords (MPI\_Comm comm, int - rank, int maxdims, int \*coords) - •Determine ranks for cartesian shift. (§ 6.5.5) - •int MPI\_Cart\_shift (MPI\_Comm comm, int - direction, int disp, int \*rank\_source, - •int \*rank\_dest) - •Split into lower dimensional sub-grids. (§ 6.5.6) - •int MPI\_Cart\_sub (MPI\_Comm comm, int - \*remain\_dims, MPI\_Comm \*newcomm) - Related Functions: MPI\_Graph\_create, MPI\_Topo\_test, - •MPI\_Graphdims\_get, MPI\_Graph\_get, - •MPI\_Cartdim\_get, MPI\_Cart\_get, - •MPI\_Graph\_neighbors\_count, MPI\_Graph\_neighbors, - •MPI\_Cart\_map, MPI\_Graph\_map ## **Compilation and Runtime Execution** - HPC Runtime: StarPU, XKaapi, RMC - Parallel compilation & Polyhedral model : ClooG, Graphite, R-Stream - (often produces OpenMP-ish code) These always operate by trying to exhibit maximal paralleism on the application side (and the kind of parallelism wanted first), then try to adjust it dynamically to a "phntom" architecture only reflected by a few features showing avaibility of simple resources #### Can we do better with a real architecture model? (Of course static control restrictions for compile-time decisions) # Data/Task parallel levels: Nested For-Loop programs with affine bounds ## Both code and model: - Strict separation between indexes and other variables (only indexes impact control) - Other variables are array locations (referenced by affine expressions of indexes) May produce explicit control/data flowgraphs to extracy potential parallelism (from dependences between operations) - Extended: precise but untractable - Reduced: quality of solution depends on information preserved - Dependence levels (Allen-Kennedy) - □ Direction vectors (Lamport, Wolf & Lam) - Polyhedral models (Feautrier) In all 3 cases, solutions found are expressed by regular (affine) scheduling relations between operations: logical clocks! ## Silly simple example for $$j = 1$$ to $N$ for $i = 1$ to $N$ $a(i+1,j+1)=a(i, j+1)+a(i+1,j)$ end end $$step(a(i,j)) = i+j-1$$ #### Task streaming, software pipelining #### **Tiling** •...to adjust it to match the size of a single Processing Element •Locality and shared memory space makes performance highly predictable #### Typical (and useful) library kernels - Linear algebra: LinPack ScalaPack (Blast),... - Convolutions Fast Fourier Transform: FFTW, Spiral - Neural Networks (only execution, not training part): a mix of both - Signal Processing (filters): Halide - Stencil algorithms for numerical analysis ./ scientific computations This type of effort lead to **Domain-Specific Languages** Restricted expressivity: linear or DAG fillter streams Transformations to exhibit some form of parallelism (data or vectorize) Scheduling often not fully automatic, or obtained by progressive selection (Dynamic Programming, Auto-tuning, or "Auto-Scheduler"...) ...although algorithm stream definitions are often recursive in terms of the data (multi-array) size, so that with a simple core architecture target model it is easy to compute which size fits in (and then loop sequentially to get results). (Application ground floor resembles architecture groundfloor) Inria #### **Fast Fourier Transform** Can easily be decomposed (recursively up to intermediate permutations) along sizes 2<sup>N</sup> Given a target processing core (or thread): - its local register count 2<sup>R</sup> - its local memory 2<sup>M</sup> - its number of vectorized (simd) or parallel (GPU) ALUs 2<sup>C</sup> one can compute by proportions a pattern adjusting the available parallelism (and sequential loops around) #### **Convolutional Neural Networks** source: http://deeplearning.net/tutorial/lenet.html Task streamin for layers + potential data parallelism according to locality and tile overlap #### Dream... #### When - Application restricted to simple filter DAG/pipeline kernel - Data value organization well understoood (and statically known) - Simple processor targeted (single core, processing element, or thread) - Shared memory, low-cost communication (data transfer) - Mutual correspondance of sizing between appli and archi, and simple cost function formulas, Then the ground floor mapping of library kernels to core should deliver time-predictable results (WCET) Then one could consider optimizing full (static control) applications on full platform, designing time-predicatble communication patterns. #### **ConvNets on Kalray MPPA** source: B. Ganne (Kalray), Machine Learning on Multicores, NeuroStic Days 2015 (in French) #### FFT on Intel Haswell (Mohamed Bergach PhD thesis) #### SIMD CPU cores (4 cores) - each core enough L1/register memory to execute 4 stages at once : 2 256-bit SIMD ALUs, 16 256-bit registers, 128 floating-point numbers - FFT 1064 (radix 10, 10 stages) performed as iteration 4+4+2 - in C++ with simd intrinsics #### GPU Processing Elements (40) - each PE enough L1/register memory to execute 3 stages at one 2 128-bit ALUs, 2K bit registers, 64 floating-point numbers - FFT 1024 executed as iteration 3+3+3+2 - in OpenCL To force allocation one needs affinity (dismiss hyperthreading etc), currently not obvious In the end one gets a libray of mapped functions with reasonably accurate cost, to be used by upper floor analyses (nria\_ # **Concurrent Models of Computation and Communication** **Process Networks, Scheduling and Routing** MemoCode 2016 #### SDF (synchronous data-flow process networks) - Weighted Marked Graphs (conflict-free subset of Petri Nets) - Distinct issues for acyclic graphs and strongly connected components - First-level self-timed semantics - Second stage: optimal scheduling exist, assuming no parallelism limitations - Static schedules lead to regular form (the binary word of activations of any node is of specific form - all nodes adopt the same rate #### Regular schedules in SDF The noticeable thing here: (static) schedules can be represented as regular infinite binary words u.(v), u initial, v repeated where - 1 at location n means active at step n - 0 at location n means idle at step n While very primitive, it corresponds to the way ASAP schedule assignment goes Later it can be turned into Gantt Charts, when sequences get long No architectural resource constraints considered here: ideal parallelism #### Regular schedules in SDF Now what if we map (here superpose) to an execution platform? - Cost of communication is no longer uniform (nuMA) - Computations or communications sharing a single resource may need to be serialized (multitasking) But the principles of asap scheduling stays the same! Even more constraints may be applied and task fission/fusion also... To what extent can we represent in this abstract setting the phenomena encountered earlier? #### Adding predictable deterministic control to SDF - add a switch node), but with internal switching condition (as in Kahn Process Networks → functional determinism, latency-insensitive design - done in Cyclo-Static Data-Flow and StreamIt graphs (but without initialization patterns - Our own view: add regular **switching/routing** patterns in the <u>same</u> flavor as schedule words activating conditions - → KRG process network model - Study (equational, algebraic) graph transformations that preserve functionality (self-timed) - these transformations made to change the buffer/data dependencies to adjust to a given platform communication topology graph (eg, NoC) #### StreamIt example roundrobin branch dispatcher and duplicate parallel output confused as "splitters" (unfortunate?) drawn from William Thies PhD dissertation (MIT, 2009) #### Silly little KRG example #### **Transitivity of Selects** ### on/when operators ``` (0.u) on v = 0.(u on v) (1.u) on x.v = x.(u on v) (x.u) when (0.v) = u when v (x.u) when (1.v) = x. (u when v) ``` #### on effects $$(0.u)$$ on $v = 0.(u$ on $v)$ $(1.u)$ on $x.v = x.(u$ on $v)$ #### when effects most meaningful whenever u subclock of v (x.u) when (0.v) = u when v (x.u) when (1.v) = x. (u when v) #### **Transitivity of Merges** #### **Selects up across Merges** #### **Sharing vs disorder** #### Normal forms (point-to-point links) **ComRed meeting** 19/11/2016 60 #### Lessons Regular scheduling arises naturally as solution space for optimal results - Classical scheduling of Process Networks - Scheduling of nested For-loops programs wth affine bounds - • Regular switching patterns can match the expressive level of description - represent the transformations in data transfer/communication pattern - play with the boundaries between data- and task- parallelism #### Task streaming level: Scheduling Theory Original problem is to schedule independant tasks with hard deadline requirements on a mono- or multi-processor - static scheduling of periodic tasks: Rate-Monotonic Analysis, Deadline-Monotonic - dynamic schedling of periodic/sporadic preemptible tasks: Earliest-deadlinefirst, least-laxity first,... Simplified assmptions lead to positive methods/results Then accounting for communications, context-switches in preemption, and further issues make life harder (and literature bigger) First step is to specify in clear mathematical formulas what are the real-time constraints # 5 A Clock Constraint Specification Language Declarative specifications for the top floor #### **Multiform Logical Time** Main idea is simple: every (pure) event that occurs repeatedly can be used as a *logical clock* logical clock = sequence/flow of ticks/signals = activation condition #### examples: - (physical) clock cycles in a modern embedded processor - ignition in a car (4 times each turn of engine, no matter the speed) - sensor event detection (such as infra-red cells, gesture detection, lidars,..) - by the way, regular clockwatch physical time can also be considered logical... We claim Multiform Logical Time is natural and invaluable at design/specification time • Expanding multiform logical time to uniform physical time depends on implementation conditions #### Clock Constraint Specification Language) CCSL - Meant to express constraints and properties in (multiform) logical time - Targets the platform-based design/AAA framework - Inspired from Synchronous (actually polychronous) languages - Inspired from Classical and Real-Time Scheduling notions. - Not usually stand-alone (extracts the ordering relations between events whose meaning is part of a larger specification) - Formal syntax to reason about logical time relations (including simultaneity) - Concerns for expressivity and decidability #### Two natural partial orders #### a subclock b (a $\subseteq$ b) - inspired from synchronous (polychronous) languages - inpired from shapes of regular static scheduling / parallel allocation - inspired from hardware and system design #### a faster than b (a $\leq$ b) - inspired from real-time scheduling - inspired from Timed Automata - • used either in an imperative or declarative fashion each tractable individually, but only the combination truly expressive ... and problematic #### Brief recap on Synchronous languages 1. Signal /Lustre Logical clocks not syntactic elements in Lustre, no totally first-class in Signal, explicit activation condition in Scade #### **Brief recap on Synchronous languages 2. Esterel** pause (next instant) start P when S P;Q abort P when S P || Q stop P when S present S then P else Q suspend P when S if ?S then P else Q activate P when S emit S signal S in P end + sequential computations on data variables Signals are first-class citizen logical clocks They represent shared variables with precise constructive consensus semantics #### An example program loop await A; await immediate B emit C(A?) end C= A sampledOn B - Logical clocks <u>are</u> the event-driven control structures (with registers that are latching instants) - new clocks from old clock (only present-else case is problem, but usually triggered by some previous clock/register) - Simple Mealy machine interpretation for each construct - "Sensible" clocks should tick infinitely often #### A second example (weak)every B do (weak)abort await B; emit Err when n\*A done end B SporadicOn A - Constraints seen as Observers - Assume/Guarantee #### **CCSL** in a foil a, b, c .. clocks, w mask (infinite regular binary word) #### expressions #### relations/constraints | a union b | only way to « speed up » rate (more ticks) | a = b | |--------------|--------------------------------------------|--------------------| | a inter b | (less ticks, may become finite) | a≠b | | a minus b | (less ticks, may become finite) | a≤b | | fastest(a,b) | min of timings | a < b | | slowest(a,b | max of timings | a # b | | sync(a,b) | when timings match | a AlternatesWith b | a filteredBy w subsampling, solution-oriented (k-periodic, regular) a sampledOn b sporadic sampling #### **Results (extracts)** Each individual constructs translates into a: transition-labelled extended Büchi Mealy transition systems. Not always Finite-State machine, also integer counters (cf fastest(C,D),or $C \le D$ ) Full system is a parallel product of such t-E Büchi Mealy machines Too Expressive (can encode Petri Nets with inhibitor arcs, and similar arguments as classical 2 counter machines) → Turing-Complete But many sources of sensible restrictions ensuring decidability #### **Encoding asynchronous Petri Nets** #### Clocks - o t\_p, p\_t for each t, p connected - t for each t (firing the transition) - p\_in and p\_out for put/get token to/from that place $$\mathbf{t} = \mathbf{inter}(\mathbf{p}_{\mathbf{t}}, p \text{ in } \bullet \mathbf{t}) \cap \mathbf{inter}(\mathbf{t}_{\mathbf{p}}, p \text{ in } \mathbf{t} \bullet)$$ t1 # t2 ... # tn (transition are exclusive in time, not to mix tokens) (tokens consumed only after being produced) Note: in general Petri Nets, weights can be replaced by multiple places) #### **Encoding PN inhibitor arcs** p\_in # p\_out void\_q = fastest((q\_out shiftedBy init\_Tok\_q),q\_in) inter q\_out ticks when place becomes empty unvoid\_q = void\_q SampledBy q\_in ticks when place becomes occupied t SampledOn (void\_q union unvoid\_q) = t SampledOn unvoid q void alternates with unvoid, these main equation states that t occurs after the void Wemocode 2016 Univoid #### Schedulability issue: simplest example • $B \subseteq A$ A filtered by (01) = B filtered by (01) Here in fact B=A, since all even occurrences must coincide (while odd occurrences seem free), B can never catch up with any delay. But it may be noticed too late in plain simulation (without backtracking) #### Many tractable subsets ``` If the clock dependency is a forest (no reconvergence) If it is a DAG but without synchronizing operators (inter,..) in the middle More to be defined ``` 6 Conclusions ## Thank you www.inria.fr # 6 Platform-Based Design and Y-Chart methodology Sous-titre facultatif