ABSTRACT

Out-of-order superscalar processors are currently the only architecture that speeds up irregular programs, but they suffer from poor power efficiency. To tackle this issue, we focused on how to specify register operands. Specifying operands by register names, as conventional RISC does, requires register renaming, resulting in poor power efficiency and preventing an increase in the front-end width. In contrast, a recently proposed architecture called STRAIGHT specifies operands by inter-instruction distance, thereby eliminating register renaming. However, STRAIGHT has strong constraints on instruction placement, which generally results in a large increase in the number of instructions.

We propose Clockhands, a novel instruction set architecture that has multiple register groups and specifies a value as “the value written in this register group k times before.” Clockhands does not require register renaming as in STRAIGHT. In contrast, Clockhands has much looser constraints on instruction placement than STRAIGHT, allowing programs to be written with almost the same number of instructions as Conventional RISC. We implemented a cycle-accurate simulator, FPGA implementation, and first-step compiler for Clockhands and evaluated benchmarks including SPEC CPU. On a machine with an eight-fetch width, the evaluation results showed that Clockhands consumes 7.4% less energy than RISC while having performance comparable to RISC. This energy reduction increases significantly to 24.4% when simulating a futuristic up-scaled processor with a 16-fetch width, which shows that Clockhands enables a wider front-end.

  1. Mehdi Alipour, Stefanos Kaxiras, David Black-Schaffer, and Rakesh Kumar. 2020. Delay and Bypass: Ready and Criticality Aware Instruction Scheduling in Out-of-Order Processors. In 2020 IEEE International Symposium on High Performance Computer Architecture (San Diego, California) (HPCA 2020). IEEE Computer Society, Los Alamitos, CA, USA, 424–434. https://doi.org/10.1109/HPCA47549.2020.00042Google ScholarGoogle ScholarCross RefCross Ref
  2. Mehdi Alipour, Rakesh Kumar, Stefanos Kaxiras, and David Black-Schaffer. 2019. FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors. In 2019 Design, Automation & Test in Europe Conference & Exhibition (Florence, Italy) (DATE). EDAA, Leuven, Belgium, 716–721. https://doi.org/10.23919/DATE.2019.8715034Google ScholarGoogle ScholarCross RefCross Ref
  3. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7. https://doi.org/10.1145/2024716.2024718Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Richard P. Brent and H. T. Kung. 1982. A Regular Layout for Parallel Adders. IEEE Trans. Comput. 31, 3 (March 1982), 260–264. https://doi.org/10.1109/TC.1982.1675982Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Thomas Burd, Wilson Li, James Pistole, Srividhya Venkataraman, Michael McCabe, Timothy Johnson, James Vinh, Thomas Yiu, Mark Wasio, Hon-Hin Wong, Daryl Lieu, Jonathan White, Benjamin Munger, Joshua Lindner, Javin Olson, Steven Bakke, Jeshuah Sniderman, Carson Henrion, Russell Schreiber, Eric Busta, Brett Johnson, Tim Jackson, Aron Miller, Ryan Miller, Matthew Pickett, Aaron Horiuchi, Josef Dvorak, Sabeesh Balagangadharan, Sajeesh Ammikkallingal, and Pankaj Kumar. 2022. Zen3: The AMD 2nd-Generation 7nm x86-64 Microprocessor Core. In 2022 IEEE International Solid-State Circuits Conference (Virtual Conference) (ISSCC, Vol. 65). IEEE, New York, NY, USA, 1–3. https://doi.org/10.1109/ISSCC42614.2022.9731678Google ScholarGoogle ScholarCross RefCross Ref
  6. Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, and William Yoder. 2004. Scaling to the end of silicon with EDGE architectures. Computer 37, 7 (July 2004), 44–55. https://doi.org/10.1109/MC.2004.65Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. George Z. Chrysos and Joel S. Emer. 1998. Memory dependence prediction using store sets. In The 25th Annual International Symposium on Computer Architecture (Barcelona, Spain) (ISCA). IEEE Computer Society, Los Alamitos, CA, USA, 142–153. https://doi.org/10.1109/ISCA.1998.694770Google ScholarGoogle ScholarCross RefCross Ref
  8. EEMBC. 2009. CoreMark. https://www.eembc.org/coremark/Google ScholarGoogle Scholar
  9. Robert B. Garner, Anant Agrawal, Fayé Briggs, Emil W. Brown, David Hough, Bill Joy, Steve Kleiman, Steven Muchnick, Masood Namjoo, Dave Patterson, Joan Pendleton, and Richard Tuck. 1988. The scalable processor architecture (SPARC). In Digest of Papers. COMPCON Spring 88 Thirty-Third IEEE Computer Society International Conference (San Francisco, California). IEEE Computer Society, Los Alamitos, CA, USA, 278–283. https://doi.org/10.1109/CMPCON.1988.4874Google ScholarGoogle ScholarCross RefCross Ref
  10. Peter Greenhalgh. 2011. Big.LITTLE Processing with ARM Cortex™-A15 & Cortex-A7. ARM white paper (Sept. 2011), 1–8.Google ScholarGoogle Scholar
  11. Linley Gwennap. 2019. Cortex-A77 Improves IPC. Microprocessor Rep. (2019), 1–4.Google ScholarGoogle Scholar
  12. Jerry Huck, Dale Morris, Jonathan Ross, Allan Knies, Hans Mulder, and Rumi Zahir. 2000. Introducing the IA-64 architecture. IEEE Micro 20, 5 (Sept.–Oct. 2000), 12–23. https://doi.org/10.1109/40.877947Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hidetsugu Irie, Toru Koizumi, Akifumi Fukuda, Seiya Akaki, Satoshi Nakae, Yutaro Bessho, Ryota Shioya, Takahiro Notsu, Katsuhiro Yoda, Teruo Ishihara, and Shuichi Sakai. 2018. STRAIGHT: Hazardless Processor Architecture without Register Renaming. In The 51st Annual IEEE/ACM International Symposium on Microarchitecture (Fukuoka, Japan) (MICRO-51). IEEE Computer Society, Los Alamitos, CA, USA, 121–133. https://doi.org/10.1109/MICRO.2018.00019Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Aakash Jani. 2021. Apple Ships Its First PC Processor. Microprocessor Rep. (2021), 1–5.Google ScholarGoogle Scholar
  15. Ipoom Jeong, Jiwon Lee, Myung Kuk Yoon, and Won Woo Ro. 2022. Reconstructing Out-of-Order Issue Queue. In 2022 55th Annual IEEE/ACM International Symposium on Microarchitecture (Chicago, Illinois) (MICRO 2022). IEEE Computer Society, Los Alamitos, CA, USA, 144–161. https://doi.org/10.1109/MICRO56248.2022.00023Google ScholarGoogle ScholarCross RefCross Ref
  16. Ipoom Jeong, Seihoon Park, Changmin Lee, and Won Woo Ro. 2020. CASINO Core Microarchitecture: Generating Out-of-Order Schedules Using Cascaded In-Order Scheduling Windows. In 2020 IEEE International Symposium on High Performance Computer Architecture (San Diego, California) (HPCA 2020). IEEE Computer Society, Los Alamitos, CA, USA, 383–396. https://doi.org/10.1109/HPCA47549.2020.00039Google ScholarGoogle ScholarCross RefCross Ref
  17. Toru Koizumi, Shu Sugita, Ryota Shioya, Junichiro Kadomoto, Hidetsugu Irie, and Shuichi Sakai. 2021. Compiling and Optimizing Real-world Programs for STRAIGHT ISA. In 2021 IEEE 39th International Conference on Computer Design (Virtual Conference) (ICCD 2021). IEEE Computer Society, Los Alamitos, CA, USA, 400–408. https://doi.org/10.1109/ICCD53106.2021.00070Google ScholarGoogle ScholarCross RefCross Ref
  18. Rakesh Kumar, Mehdi Alipour, and David Black-Schaffer. 2019. Freeway: Maximizing MLP for Slice-Out-of-Order Execution. In 25th IEEE International Symposium on High Performance Computer Architecture (Washington, D.C.) (HPCA 2019). IEEE Computer Society, Los Alamitos, CA, USA, 558–569. https://doi.org/10.1109/HPCA.2019.00009Google ScholarGoogle ScholarCross RefCross Ref
  19. Chris Lattner and Vikram Adve. 2004. LLVM: a compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization (San Jose, California) (CGO 2004). IEEE Computer Society, Los Alamitos, CA, USA, 75–86. https://doi.org/10.1109/CGO.2004.1281665Google ScholarGoogle ScholarCross RefCross Ref
  20. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (New York, New York) (MICRO 42). ACM, New York, NY, USA, 469–480. https://doi.org/10.1145/1669112.1669172Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Artemiy Margaritov, Siddharth Gupta, Rekai Gonzalez-Alberquilla, and Boris Grot. 2019. Stretch: Balancing QoS and Throughput for Colocated Server Workloads on SMT Cores. In 25th IEEE International Symposium on High Performance Computer Architecture (Washington, D.C.) (HPCA 2019). IEEE Computer Society, Los Alamitos, CA, USA, 15–27. https://doi.org/10.1109/HPCA.2019.00024Google ScholarGoogle ScholarCross RefCross Ref
  22. Susumu Mashimo, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, Akifumi Fukuda, Toru Koizumi, Junichiro Kadomoto, Hidetsugu Irie, Masahiro Goshima, Koji Inoue, and Ryota Shioya. 2019. An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor. In 2019 International Conference on Field-Programmable Technology (Tianjin, China) (ICFPT 2019). IEEE Computer Society, Los Alamitos, CA, USA, 63–71. https://doi.org/10.1109/ICFPT47387.2019.00016Google ScholarGoogle ScholarCross RefCross Ref
  23. Andreas Moshovos. 2003. Checkpointing Alternatives for High Performance, Power-Aware Processors. In Proceedings of the 2003 International Symposium on Low Power Electronics and Design (Seoul, South Korea) (ISLPED ’03). ACM, New York, NY, USA, 318–321. https://doi.org/10.1145/871506.871585Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. 2015. Exploring the Potential of Heterogeneous von Neumann/Dataflow Execution Models. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA ’15). ACM, New York, NY, USA, 298–310. https://doi.org/10.1145/2749469.2750380Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. 2019. Heterogeneous Von Neumann/Dataflow Microprocessors. Commun. ACM 62, 6 (May 2019), 83–91. https://doi.org/10.1145/3323923Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Salvador Petit, Rafael Ubal, Julio Sahuquillo, and Pedro López. 2014. Efficient Register Renaming and Recovery for High-Performance Processors. IEEE Trans. VLSI Syst. 22, 7 (July 2014), 1506–1514. https://doi.org/10.1109/TVLSI.2013.2270001Google ScholarGoogle ScholarCross RefCross Ref
  27. Efraim Rotem, Adi Yoaz, Lihu Rappoport, Stephen J. Robinson, Julius Yuli Mandelblat, Arik Gihon, Eliezer Weissmann, Rajshree Chabukswar, Vadim Basin, Russell Fenger, Monica Gupta, and Ahmad Yasin. 2022. Intel Alder Lake CPU Architectures. IEEE Micro 42, 3 (May 2022), 13–19. https://doi.org/10.1109/MM.2022.3164338Google ScholarGoogle ScholarCross RefCross Ref
  28. Satish Kumar Sadasivam, Brian W. Thompto, R. Kalla, and William J. Starke. 2017. IBM POWER9 Processor Architecture. IEEE Micro 37, 2 (March 2017), 40–51. https://doi.org/10.1109/MM.2017.40Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Elham Safi, Patrick Akl, Andreas Moshovos, Andreas Veneris, and Aggeliki Arapoyianni. 2007. On the Latency, Energy and Area of Checkpointed, Superscalar Register Alias Tables. In Proceedings of the 2007 International Symposium on Low Power Electronics and Design (Portland, Oregon) (ISLPED ’07). ACM, New York, NY, USA, 379–382. https://doi.org/10.1145/1283780.1283863Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Rama Sangireddy. 2006. Reducing rename logic complexity for high-speed and low-power front-end architectures. IEEE Trans. Comput. 55, 6 (June 2006), 672–685. https://doi.org/10.1109/TC.2006.88Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hiroshi Sasaki, Fang-Hsiang Su, Teruo Tanimoto, and Simha Sethumadhavan. 2017. Why do programs have heavy tails?. In 2017 IEEE International Symposium on Workload Characterization (Seattle, Washington, USA) (IISWC). IEEE Computer Society, Los Alamitos, CA, USA, 135–145. https://doi.org/10.1109/IISWC.2017.8167771Google ScholarGoogle ScholarCross RefCross Ref
  32. Andreas Sembrant, Trevor Carlson, Erik Hagersten, David Black-Shaffer, Arthur Perais, André Seznec, and Pierre Michaud. 2015. Long Term Parking (LTP): Criticality-Aware Resource Allocation in OOO Processors. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). ACM, New York, NY, USA, 334–346. https://doi.org/10.1145/2830772.2830815Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. André Seznec and Pierre Michaud. 2006. A case for (partially) TAgged GEometric history length branch prediction. J. Instruction-Level Parallelism 8 (Feb. 2006), 1–23. https://jilp.org/vol8/v8paper1.pdfGoogle ScholarGoogle Scholar
  34. Tingting Sha, Milo M. K. Martin, and Amir Roth. 2006. NoSQ: Store-Load Communication without a Store Queue. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (Orlando, Florida) (MICRO-39). IEEE Computer Society, Los Alamitos, CA, USA, 285–296. https://doi.org/10.1109/MICRO.2006.39Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ryota Shioya and Hideki Ando. 2014. Energy efficiency improvement of renamed trace cache through the reduction of dependent path length. In 2014 32nd IEEE International Conference on Computer Design (Seoul, South Korea) (ICCD). IEEE Computer Society, Los Alamitos, CA, USA, 416–423. https://doi.org/10.1109/ICCD.2014.6974714Google ScholarGoogle ScholarCross RefCross Ref
  36. Ryota Shioya, Masahiro Goshima, and Hideki Ando. 2014. A Front-End Execution Architecture for High Energy Efficiency. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, United Kingdom) (MICRO 2014). IEEE Computer Society, Los Alamitos, CA, USA, 419–431. https://doi.org/10.1109/MICRO.2014.35Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai. 2010. Register Cache System Not for Latency Reduction Purpose. In The 43rd Annual IEEE/ACM International Symposium on Microarchitecture (Atlanta, Georgia) (MICRO 2010). IEEE Computer Society, Los Alamitos, CA, USA, 301–312. https://doi.org/10.1109/MICRO.2010.43Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Faissal M. Sleiman and Thomas F. Wenisch. 2016. Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading. In 2016 43rd International Symposium on Computer Architecture (Seoul, South Korea) (ISCA 2016). IEEE Computer Society, Los Alamitos, CA, USA, 431–443. https://doi.org/10.1109/ISCA.2016.45Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture (Scottsdale, Arizona) (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 63–74. https://doi.org/10.1109/HPCA.2007.346185Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Standard Performance Evaluation Corporation. 2006. Standard performance evaluation corporation CPU2006 benchmark suite. https://www.spec.org/cpu2006/Google ScholarGoogle Scholar
  41. Standard Performance Evaluation Corporation. 2017. Standard performance evaluation corporation CPU2017 benchmark suite. https://www.spec.org/cpu2017/Google ScholarGoogle Scholar
  42. A.S. Tanenbaum. 1980. The Future of Distributed Computer Architecture. Informatie 22, July/Augus (1980), 500–503.Google ScholarGoogle Scholar
  43. Sriram Vajapeyam and Tulika Mitra. 1997. Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences. In Proceedings of the 24th Annual International Symposium on Computer Architecture (Denver, Colorado) (ISCA ’97). ACM, New York, NY, USA, 1–12. https://doi.org/10.1145/264107.264119Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Naveen Vedula, Arrvindh Shriraman, Snehasish Kumar, and William N Sumner. 2018. NACHOS: Software-Driven Hardware-Assisted Memory Disambiguation for Accelerators. In 24th IEEE International Symposium on High Performance Computer Architecture (Vienna, Austria) (HPCA 2018). IEEE Computer Society, Los Alamitos, CA, USA, 710–723. https://doi.org/10.1109/HPCA.2018.00066Google ScholarGoogle ScholarCross RefCross Ref
  45. Kenichi Watanabe, et al.2005. Processor simulator Onikiri2. https://github.com/onikiri/onikiri2Google ScholarGoogle Scholar
  46. Andrew Waterman and Krste Asanovíc. 2019. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document Version 20191213. RISC-V Foundation.Google ScholarGoogle Scholar
  47. Neil H. E. Weste and David Money Harris. 2010. CMOS VLSI Design: A Circuits and Systems Perspective (fourth ed.). Addison Wesley, Boston, MA, USA.Google ScholarGoogle Scholar
  48. Kenneth C. Yeager. 1996. The MIPS R10000 superscalar microprocessor. IEEE Micro 16, 2 (April 1996), 28–41. https://doi.org/10.1109/40.491460Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Clockhands: Rename-free Instruction Set Architecture for Out-of-order Processors

        Recommendations

        • STRAIGHT: hazardless processor architecture without register renaming

          MICRO-51: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture

          The single-thread performance of a processor improves the capability of the entire system by reducing the critical path latency of programs. Typically, conventional superscalar processors improve this performance by introducing out-of-order (OoO) …

        • Out-of-order vector architectures

          MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture

          Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory …

        Read More