Research Assistant Positions Available: Looking for bright, creative, self-motivated, and hardworking students to work in our funded projects. If you are interested in one of the following research topics, you can contact Dr. Ozcan Ozturk to discuss research opportunities.
Parallel Systems: Cloud Computing, GPU Computing, Manycore Accelerators, Parallel Programming
Processor Architecture: Multicore Processors, Computer Organization, Reliability-Aware 3D Chip Multiprocessor Design, Heterogeneous Chip Multiprocessors, Network On Chip Architectures
Compilers: Automatic Parallelization, Dataflow Analysis, Optimizing Compilers, Memory Optimization
The aim of this project is to design a processor architecture that will run large datasets on irregular graphs quickly, efficiently and in an easily programmable fashion. Most of the studies previously presented for graph applications are only at the software or accelerator level, and the processor-level designs are different from the processor to be introduced for this project in terms of hardware cost, required architectural support, and instruction set modifications.
Using Accelerator + CPU Technologies in Graph Parallel Applications implemented on Intel HARP Platform. Users implement accelerator hardware on FPGA and execute applications on both FPGA and CPU simultaneously. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions that can execute on FPGA and CPU at the same time.
Using a compiler or a code analysis tool, a program written in a certain language can be optimized for the same language or partly/completely translated into a different target language. However, improvements can be beyond optimization and efficiency, such as enhancing the user experience. Although hardware accelerators increase the efficiency in orders of magnitude, the underlying implementation for these architectures such as GPU, FPGA, and ASIC, are much harder when compared to a high level language such as C++. Therefore, if a developer cannot use the accelerators due to cost of hardware and adversities of the implementation, he or she will have to accept the outcome of CPU execution. On the other hand, the existence of a tool that accepts a simple C++ code and can execute on an FPGA will provide benefits of both worlds and will allow executing on larger graphs.
Specialized hardware accelerators can significantly improve the performance and power efficiency of compute systems. In this research, we focus on hardware accelerators for graph analytics applications and propose a configurable architecture template that is specifically optimized for iterative vertex-centric graph applications with irregular access patterns and asymmetric convergence. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures for these types of applications. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions. After that, a cycle-accurate simulator and RTL can be generated to model the target hardware accelerators. In our experiments, we study several graph-parallel applications, and show that the hardware accelerators generated by our template can outperform a 24 core high end server CPU system by up to 3x in terms of performance. We also estimate the area requirement and power consumption of these hardware accelerators through physical-aware logic synthesis, and show up to 65x better power consumption with significantly smaller area.
Manycore accelerators are being deployed in the Cloud to improve the processing capabilities and to provide heterogeneity as Cloud Computing Systems become increasingly complex and process ever-increasing datasets. In such systems, application scheduling and data mapping need to be enhanced to maximize the utilization of the underlying architecture. To achieve this, Cloud management schemes require a fresh look when the underlying components are heterogeneous in many different ways. Moreover, applications differ in the way they perform on various special accelerators. Our goal is to design a runtime management system for Cloud Systems with manycore accelerators.
Manycore accelerators are being deployed in many systems to improve the processing capabilities. In such systems, application mapping need to be enhanced to maximize the utilization of the underlying architecture. Especially in GPUs, mapping becomes critical for multi-kernel applications as kernels may exhibit different characteristics. While some of the kernels run faster on GPU, others may refer to stay in CPU due to the high data transfer overhead. Thus, heterogeneous execution may yield to improved performance compared to executing the application only on CPU or only on GPU. We would like to design systems with smart kernel mapping.
The importance of parallel programming has dramatically increased with the emergence of multicore and manycore architectures. We specifically focus on tools and techniques that enables the programmer to develop parallel programs wasily. The task of identifying parallel code sections for different manycore architectures including Intel MIC can be carried out by the compiler. We aim to alleviate the compiler infrastructure to parallelize the applications automatically.
Ability to stack separate chips in a single package enables three- dimensional integrated circuits (3D ICs). Heterogeneous 3D ICs provide even better opportunities to reduce the power and increase the performance per unit area. An important issue in designing a heterogeneous 3D IC is reliability. To achieve this, one needs to select the data mapping and processor layout carefully. We try to address this problem using an ILP approach.
Increasing complexity of applications and their large dataset sizes make it imperative to consider novel architectures that are efficient from both performance and power angles. Chip Multiprocessors (CMP) are one such example where multiple processor cores are placed into the same die. As technology scales, the International Technology Roadmap for Semiconductors (ITRS) projects that the number of cores in a chip multiprocessor (CMP) will drastically increase to satisfy performance requirements of future applications. A critical question that needs to be answered in CMPs is the size and strength of the cores. Homogeneous chip multiprocessors provide only one type of core to match these various application requirements, consequently not fully utilizing the available chip area and power budget. The ability to dynamically switch between different cores, and power down unused cores gives a key advantage to heterogeneous chip multiprocessing. One of the challenging problems in the context of heterogeneous chip multiprocessor systems is the placement of processor cores and storage blocks within the available chip area. Focusing on such a heterogeneous chip multiprocessor, we address different design decision problems.
Memory is a key parameter in embedded systems since both code complexity of embedded applications and amount of data they process are increasing. While it is true that the memory capacity of embedded systems is continuously increasing, the increases in the application complexity and dataset sizes are far greater. As a consequence, the memory space demand of code and data should be kept minimum. To reduce the memory space consumption of embedded systems, this paper proposes a control flow graph (CFG) based technique. Specifically, it tracks the lifetime of instructions at the basic block level. Based on the CFG analysis, if a basic block is known to be not accessible in the rest of the program execution, the instruction memory spaceallocated to this basic block is reclaimed. On the other hand, if the memory allocated to this basic block cannot be reclaimed, we try to compress this basic block. This way, it is possible to effectively use the available on-chip memory, thereby satisfying most of instruction/ data requests from the on-chip memory.
As commonly accepted, current performance trajectory to double the chip performance every 24 to 36 months, can be achieved by the integration of multiple processors on a chip rather than through increases in the clock rate of single processors due to the power limitations present in processor design. Multicore architectures have already made their way in the industry, with more aggressive configurations being prototyped such as the Intel's 80 core TeraFlop. Since future technologies offer the promise of being able to integrate billions of transistors on a chip, the prospects of having hundreds of processors on a single chip along with an underlying memory hierarchy and an interconnection system is entirely feasible. Point-to-point buses will no longer be feasible after a certain number of nodes as the communication requirements between nodes will exponentially increase with the number of processors. A viable interconnection system shown to be promising for these future CMPs is Network-on-Chip (NoC) since it provides scalable, flexible, and programmable communication. With this NoC based chip multiprocessor (NoC based CMP) as the computing platform, a very rich set of research challenges arise. Circuit and architectural challenges such as router design, IP placement, and sensor placement are currently being studied in both industry and academia. In comparison, the work on heterogeneous alternatives for these architectures has received considerably less attention.
A critical component of a chip multiprocessor is its memory subsystem. This is because both power and performance behavior of a chip multiprocessor is largely shaped by its on-chip memory. While it is possible to employ conventional memory designs such as pure private memory or pure shared memory, such designs are very general and rigid, and may not generate the best behavior for a given embedded application. Our belief is that, for embedded systems that repeatedly execute the same application, it makes sense to design a customized, software-managed on-chip memory architecture. Such a memory architecture should be a hybrid one that contains both private and shared components. In the hybrid architecture case, while some processors have private memories, others do not have one. Similarly, the different processor groups can share memory in different fashions. For example, a memory component can be shared by two processors, whereas another component can be shared by three processors.
Designing such a customized hybrid memory architecture is not trivial because of at least three main reasons. First, since the memory architecture to be designed changes from one application to another, a hand-waived approach is not suitable, as it can be extremely time consuming and error-prone to go through the same complex process each time we are to design a memory system for a new application. Therefore, we need an automated strategy that comes up with the most suitable design for a given application. Second, design of such a memory needs to be guided by a tool that can extract the data sharing exhibited by the application at hand. After all, in order to decide how the different memory components need to be shared by parallel processors, one needs to capture the data sharing pattern across the processors. Third, data allocation in a hybrid memory system is not a trivial problem, and should be carried out along with data partitioning if we are to obtain the best results.