Heterogeneous Chip Multiprocessor Design
(HTCMP)
People:
Asst. Prof. Ozcan Ozturk
MS. Student Dilek Demirbas
MS. Student Ismail Akturk
MS. Student Mahmut Sami Dikici
Ph.D. Student Naveed Ul Mustafa
Funding: Funded by
European Commission (EC) FP7-PEOPLE MARIE CURIE ACTIONS
Duration: 2009 – 2013
Amount: €100,000
Abstract:
Increasing
complexity of applications and their large dataset sizes make it imperative to
consider novel architectures that are efficient from both performance and power
angles. Chip Multiprocessors (CMP) are one such example where multiple
processor cores are placed into the same die. As technology scales, the
International Technology Roadmap for Semiconductors (ITRS) projects that the
number of cores in a chip multiprocessor (CMP) will drastically increase to
satisfy performance requirements of future applications. A critical question
that needs to be answered in CMPs is the size and strength of the cores.
Homogeneous chip multiprocessors provide only one type of core to match these
various application requirements, consequently not fully utilizing the
available chip area and power budget. The ability to dynamically switch between
different cores, and power down unused cores gives a
key advantage to heterogeneous chip multiprocessing. One of the challenging
problems in the context of heterogeneous chip multiprocessor systems is the
placement of processor cores and storage blocks within the available chip area.
Focusing on such a heterogeneous chip multiprocessor, we address different
design decision problems. First, decide on the memory hierarchy design and its
distribution within the available chip area. Second, distribute effectively the
available area among the processor cores and the memory blocks (cache). Third,
select the optimum number of processors and their types among the available
processor types. Fourth, perform thread and data distribution within the given
processor and memory design. Fifth, evaluate improvements brought by advanced
techniques, such as 3D designs. Our past experience and preliminary results indicate
that the proposed approach will be able to generate promising results.
Report Period II:
PUBLISHABLE
SUMMARY
HTCMP aims at
designing efficient and powerful heterogeneous chip multiprocessors (HTCMP). A
challenging problem in the context of heterogeneous chip multiprocessor systems
is the placement of processor cores and storage blocks within the available
chip area. Focusing on such a heterogeneous chip multiprocessor, we address
different design decision problems: (i) effective
distribution of the available area among the processor cores and the memory
blocks (cache), (ii) memory hierarchy design (iii) selection of number of
processors and their types from the processor pool, (iv)
thread and data distribution (v) advanced techniques such as 3D designs.
Our main
objective is to make significant contributions towards the development of
compiler-based techniques for emerging and future HTCMPs. The software support
for such systems is lagging way behind current advancements at the circuit and architecture
levels. Effective compilation support for these architectures will make
programming them much easier, thereby helping scientists to port their
applications to these architectures. Outcomes of this research will be
beneficial to the computer architecture field in that it will reveal the types
of processor cores and memory components that are needed by the compiler for
achieving the best application adaptation under dynamically changing power,
performance and thermal conditions.
During the
second report period, we built over what we have done during the first report
period. The two major contributions during the first report period were 1) Distribute
the available area among the processor cores and the memory blocks and 2) Processor
selection. Our contributions in the second period were 1) Thread and data distribution, 2) Communication Reduction, and 3)
Advanced Optimizations. In the first part, we introduce an
application-specific heterogeneous Network-On-Chip (NoC)
design algorithm that considers the given constraints and generates a floorplan for the desired many-core. On the other hand,
second part aims at minimizing the communication costs of 3D NoC architectures. The third part proposes a
reliability-aware 3D NoC design by reducing the
inter-layer communications in a 3D NoC design.
During the
second reporting period, project resulted in two direct journal publication,
one conference publication, one poster and three indirect journal publications.
Moreover, three M.S. students and one PhD student were supported.
PROJECT OBJECTIVES FOR THE PERIOD
After the
first reporting period, we have worked on the three main objectives listed
below.
Thread and data distribution
In the design
of chip multiprocessors, architects and designers must decide how thread and
data distribution should happen to realize the desired chip. However, the immense
amount of variation in processing cores makes this decision tedious and error
prone. Thus, figuring out the types of processing cores to be used is of great importance.
Our algorithm identifies the processing cores that will be used in the design. Our
algorithm places the selected cores on a given chip area in a way that the
total latency occurring on the chip is minimized, while the given area is utilized
as much as possible. It should be noted that the regular NoC
topologies are inappropriate for such cases because heterogeneous architectures
have non-uniform sets of processing cores. Thus, an effective custom NoC is the key to achieve the desired performance of a heterogeneous
multi-core.
Communication Reduction
NoC architectures have been extended to the third dimension by the help of
through silicon vias (TSVs). 3D NoCs
have the potential to achieve better performance with higher scalability and
lower power consumption. Most of the related work on 3D NoCs consider homogeneous cores. While, 3D NoCs provide the aforementioned benefits, the best
utilization cannot be extracted without including heterogeneity. This is due to
the fact that every application (and different parts of an application) has
different characteristics. Enabling heterogeneity in 3D NoC
architectures will make it possible to match all these various requirements,
while keeping energy and heat consumption as minimum as possible. Since heat is
one of the most critical issues in 3D ICs, providing heterogeneity has the
potential to meet the requirements. A well known heterogeneous (asymmetric)
Chip Multiprocessor (CMP) example is IBM’s Cell
Processor, where 1 PPU (power processing unit) and 8 SPUs (synergistic
processing unit) are combined to perform more efficiently. It was shown that a
representative heterogeneous processor using two core types achieves as much as
a 63 percent performance improvement over an equivalent-area homogeneous
processor. This is mainly due to matching execution resources to application
needs effectively. One of the challenging problems in the context of 3D NoC heterogeneous chip multiprocessor systems is the
placement of processor cores within the available chip area. Focusing on such a
heterogeneous 3D NoC, we explored how different types
of processors can be placed to minimize data access costs.
Advanced Optimizations
Ability to stack
separate chips in a single package enables three-dimensional integrated
circuits (3D ICs). Heterogeneous 3D ICs provide even better opportunities to
reduce the power and increase the performance per unit area. An important issue
in designing a heterogeneous 3D IC is reliability. To achieve this, one needs
to select the data mapping and processor layout carefully. This paper addresses
this problem using an integer linear programming (ILP) approach. Specifically,
on a heterogeneous 3D CMP, it explores how applications can be mapped onto 3D
ICs to maximize reliability. Preliminary experimental evaluation indicates that
the proposed technique generates promising results in both reliability and
performance.
WORK PROGRESS AND ACHIEVEMENTS DURING THE PERIOD
Along with
generation of custom NoC topology, there are two
other important concerns that affect the overall performance of a multi-core:
task scheduling and core mapping. Our work is complementary to task scheduling
and core mapping because their effectiveness is tightly coupled to the NoC that should be used. Our algorithm cooperates with
task-scheduling and core-mapping algorithms during the generation of the
desired NoC and the floorplan.
Although the given constraints belong to separate stages of the design process,
we take them into account as a whole. For example, maximization of area
utilization should be considered together with the latency concern. We target
application-specific heterogeneous NoC-based architecture
generation. We implemented our approach using a latency-aware
Least-Wasted-First (LWF) 2D bin-packing algorithm.
Global
interconnect problem has become more important with the increase in the number
of processor cores in chip multiprocessing. 3D designs and NoC
architectures have been unified as 3D NoCs to
overcome the interconnect scaling bottleneck. We try
to map heterogeneous processors onto the given 3D chip area with minimal data
communication costs. Our goal is to present an Integer Linear Programming (ILP)
formulation of the problem of minimizing data communication cost of a given
application. This is achieved through optimal placement of nodes in a 3D NoC. Our initial results
indicate that the proposed approach generates promising results within
tolerable solution times.
As the
technology shrinks, one of the challenging problems in the context of 3D NoC systems is reliability. Reliability of 3D ICs is
effected by both temperature and thermo-mechanical stress. This is especially
caused by the limited cooling capability between the layers. Specifically, vias become more and more sensitive and when the via fails to make proper connection, unwanted loss in
yield and decrease in reliability may occur. Reliability for 3D ICs have been explored from different angles. Through Silicon Vias (TSVs) are the most recent medium in stacking multiple
dies on a 3D IC. However, these vias become more
sensitive with higher temperatures that can be caused by more activity or
traffic. Since TSVs are bridges between layers, they are potentially more prone
to thermal stress. Therefore, reducing the TSV communication load has potential
of improving reliability. This work aims at increasing the reliability of an
application through effective mapping on 3D heterogeneous IC. Contribution of
the approach is in two folds: 1) An ILP formulation of the problem of
maximizing the reliability of a given application. This is achieved through
optimal placement of nodes in a 3D NoC.
2) Minimization of the communication cost between the nodes, thereby improving
both performance and energy consumption. ILP-based approach presented here
targets at reducing the amount of layer-to-layer communication on TSVs, while
keeping the overall communication overheads minimum.
DISSEMINATION ACTIVITIES
Directly-related
journal publications (specifically acknowledged the FP7 support):
Application-Specific Heterogeneous
Network-on-Chip Design, by Dilek Demirbas; Ismail
Akturk; Ozcan Ozturk; Ugur Gudukbay. The Computer Journal 2013; doi: 10.1093/comjnl/bxt011.
Reliability-Aware Heterogeneous 3D Chip
Multiprocessor Design, by Ismail Akturk and Ozcan
Ozturk. Accepted. Journal of Electronic
Testing: Theory and Applications.
ILP-Based Communication Reduction for
Heterogeneous 3D Network-on-Chips, Ismail Akturk
and Ozcan Ozturk. In Proc. of 21st Euromicro International Conference on
Parallel, Distributed and Networked-Based Processing (PDP'13), Belfast,
Northern Ireland, February 2013.
Reliability-Aware 3D Chip Multiprocessor
Design, Ismail Akturk and Ozcan Ozturk. In
Proc. of Manufacturable and Dependable Multicore Architectures at Nanoscale
(MEDIAN'12), Annecy, France, June 2012.
Indirect
journal publications (specifically acknowledged the FP7 support):
Compiler-Directed Energy Reduction Using Dynamic Voltage Scaling
and Voltage Islands for Embedded Systems, by O. Ozturk, M. Kandemir, and G.
Chen, IEEE Transactions on Computers (TC), Vol. 62, No. 2, pages 268-278,
February 2013.
Reducing Memory Space Consumption Through
Dataflow Analysis, by O. Ozturk, Computer Languages, Systems &
Structures, Volume 37, Issue 4, October 2011, pages 168-177.
Multicore Education Through
Simulation, by O. Ozturk, IEEE Transactions on Education (TE), Volume 54,
Issue 2, pages 203-209, May 2011.
Report Period I:
PUBLISHABLE
SUMMARY
After the
initial setup, we profiled the benchmarks and estimated their memory and
processing requirements. Based on these requirements, we have implemented two
major components of the project. In both components, after parallelization and
mapping, the input code is fed to a compiler analysis module. Purpose of this
module is to identify the set of CMP nodes that communicate with each other.
This information is subsequently passed to the solver which determines the
location of each node within the NoC based CMP and
the type of processor used for each node. Specific solvers we used in
implementing this approach are: (1) a genetic algorithm (GA) based solver
implemented using Java and (2) an integer linear programming (ILP) based solver
implemented on a commercial tool.
Project
resulted in one direct journal publication and four indirect journal
publications. During the course of the first part of the project one M.S.
student supported and is about to graduate and a Ph.D. student is still being
supported.
PROJECT OBJECTIVES FOR THE PERIOD
In the first
nine months of the project we have worked on the project setup and initial
requirements. As explained in the initial project proposal, main tasks
performed in this context are hiring research assistants, training of researchassistants, literature survey, software system
design, collecting benchmarks. Main objectives of the project for this period
apart from the initial setup are:
Distribute the available area among the processor
cores and the memory blocks
Limited die
area needs to be effectively utilized as applications differ in their
processing needs and memory requirements. While some applications can be
processor hungry and require relevantly less data, others may process bigger
data sets. Former will prefer a larger processing power, whereas the latter
will benefit from a bigger cache. For a given application, we will first
analyze the application requirements using a profiler which, in turn, will be
used in formulating the area distribution problem.
Processor selection
When Alpha
cores EV4 and EV8 is compared [2,3], one can see that EV8 consumes 80 times
larger area and 12 times more power to gain just above 2 times better
performance. Marginal benefit provided by larger cores is lower compared to
smaller cores; consequently one should try to utilize the available chip area
and power budget according to the application needs. Based on the application
requirements obtained by profiler, proposed approach will select the suitable
processing core using ILP and heuristic techniques.
WORK PROGRESS AND ACHIEVEMENTS DURING THE PERIOD
One of the
main achievements in the first period of the project is evolutionary-based
optimization. Evolutionary computational techniques start with an initial
population, where individuals of the population are generated through
variations in the input parameters. New generations are generated using new
models created through cross-over and mutation. Similar to natural selection,
some of the individuals in the population are promoted to the next generation
using a fitness function. Cross-over is performed on highest fitness
individuals, thereby forcing low-fitness individuals disappear.
We try to
minimize the communication and computation of a given parallel application by
selection and placement of nodes through a genetic algorithm (GA). We start our
GA implementation with a set of randomly generated chromosomes which form our
initial population. Size of this initial population is selected based on the NoC size and the number of different type of processors, as
it greatly affects the solution time. In our GA implementation, chromosomes are
represented as triplets, (t, x, y), where t indicates the type of processor
used for that node, whereas (x, y) indicates the coordinates of the location of
the node.
We have also
implemented an integer linear programming (ILP) based formulation of the
problem of minimizing communication and computation by determining the optimal
selection and placement of nodes in an NoC based CMP. ILP provides a set of techniques that solve
those optimization problems in which both the objective function and
constraints are linear functions and the solution variables are restricted to
be integers. The 0-1 ILP is an ILP problem in which each (solution) variable is
restricted to be either 0 or 1.
We performed
experiments with eight different array-based benchmark codes parallelized
through an optimizing compiler built upon SUIF. We performed experiments with
four different execution models for each benchmark; Homogeneous (same type of
processors), Random (randomly selected processors), GA (genetic algorithm based
processor selection), and ILP (integer linear programming based processor
selection). According to the experiments, while some of the benchmarks prefer a
homogeneous NoC topology, as in the case of ammp and vortex, most of the benchmarks take advantage of
heterogeneous NoC. The effectiveness of our GA-based
and ILP approaches increases with increasing number of processor types.
DISSEMINATION ACTIVITIES
Directly-related
journal publications (specifically acknowledged the FP7 support):
Heterogeneous NoC Design Through Evolutionary Computing, by Ozcan Ozturk and
Dilek Demirbas, International Journal of Electronics, Francis & Taylor,
Volume 97, No. 10, pages 1139-1161, 2010.
Indirect
journal publications (specifically acknowledged the FP7 support):
On-Chip Memory Space Partitioning for Chip Multiprocessors using
Polyhedral Algebra, by Ozcan Ozturk, Mahmut Kandemir, and Mary J. Irwin,
IET Computers & Digital Techniques, Volume 4, Issue 6, pages 484-498, 2010.
Data Locality and Parallelism Optimization Using A
Constraint-Based Approach, by O. Ozturk, Journal of Parallel and
Distributed Computing (JPDC), DOI: 10.1016/j.jpdc.2010.08.005.
Improving Chip Multiprocessor Reliability Through Code Replication, by Ozcan Ozturk. Computers
& Electrical Engineering, Elsevier, Issue 36, pages 480-490, 2010, ISSN
0045-7906, DOI: 10.1016/j.compeleceng.2009.11.004).
Compiler Directed Communication Reliability Enhancement for
Chip Multiprocessors, by O. Ozturk, M. Kandemir, S. Narayanan, and M. J.
Irwin. ACM SIGPLAN Notices, Vol. 45, No. 4, pp. 85-94, 2010.