期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

《北京大学学报(自然科学版)网络版(预印本)》2010,(2)

面向按序执行处理器开展预执行机制的设计空间探索,并对预执行机制的优化效果随 Cache 容量和访存延时的变化趋势进行了量化分析。实验结果表明,对于按序执行处理器,保存并复用预执行期间的有效结果和在预执行访存指令之间进行数据传递都能够有效地提升处理器性能,前者还能够有效地降低能耗开销。将两者相结合使用,在平均情况下将基础处理器的性能提升 24. 07% ,而能耗仅增加 4. 93% 。进一步发现,在 Cache 容量较大的情况下,预执行仍然能够带来较大幅度的性能提升。并且,随着访存延时的增加,预执行在提高按序执行处理器性能和能效性方面的优势都将更加显著。相似文献

2.

支持软件预取的缓存WCET分析

安立奎古志民付引霞赵鑫甘志华《北京理工大学学报》2015,35(7):730-736

许多高性能嵌入式处理器都引入了多级缓存、硬件预取及软件预取等机制,为使支持软件预取的硬实时任务具有执行时间的可预测性,提出一种支持软件预取的缓存WCET分析方法. 该方法对多级缓存抽象解释模型进行了软件预取语义扩展,分析了软件预取对任务的最坏情况下性能和能耗的影响. 实验结果表明,该方法能够对支持软件预取的多级缓存行为进行有效分析;同时软件预取优化技术可使某些访存缺失较大的硬实时任务WCET平均减少22.9%,能耗平均降低24.1%. 相似文献

3.

基于细粒度伪划分的多核私有Cache容量共享机制

黄安文 张承义宋超郭维李鹏张民选《湖南大学学报(自然科学版)》2013,40(Z1):30-36

针对多核私有Cache结构面临的容量失效问题,提出了一种基于细粒度伪划分的核间容量共享机制.通过在细粒度层次为每个Cache Bank设置加权饱和计数器阵列来统计和预测各线程的访存需求差异情况,控制各个处理器核在每个Cache Set上的私有域与共享域划分比例,并以此指导各处理器核上的牺牲块替换、溢出与接收决策,利用智能的核间容量借用机制来均衡处理器间访存需求差异,缓解多核私有Cache结构面临的容量失效问题.在体系结构级全系统模拟器上的实验结果表明,该机制能够有效改善多核私有Cache结构的容量失效问题,降低多线程应用程序的平均存储访问延迟. 相似文献

4.

片上多核中一种共享感知的数据主动推送Cache技术

王得利高德远《西安交通大学学报》2010,44(10)

针对片上多核处理器的二级Cache访问延时持续增加以及并行程序在运行时线程间执行速率差异大的问题,提出了一种基于共享感知的数据主动推送Cache技术(SAAPC).SAAPC技术充分考虑并行程序的系统性能由速度最慢的线程所决定这一重要特性,根据并行线程间读数据共享程度高以及共享读数据访问局部性好的特征,采用基于指令的方法来预测共享读数据流,在后行线程需要共享数据之前将其主动推送至该线程的一级Cache中去,从而减少较慢线程的数据访问延时,提高执行速率,降低较慢线程与先行线程间执行速率的差异.SAAPC技术避免了预取技术所带来的额外片外带宽增加的缺点.使用SESC模拟器对来自于SPLASH2测试程序集的5个存储敏感型并行程序进行了测试仿真,结果表明,与传统的共享Cache相比,使用SAAPC技术减少了并行线程间执行速率的差异,系统的每周期指令数平均提高了7%,最高达到13.1%. 相似文献

5.

流水线处理器中cache模块的设计 总被引：1，自引：0，他引：1

李红桥《科学技术与工程》2010,10(32)

流水线结构能大幅提高指令执行速度,但是由于主存读取速度过慢,系统性能的提升仍然受到限制。现实现的Cache设计,是流水线与主存间的高速缓冲器,它能有效地解决访存的瓶颈问题,使流水线功能得到充分发挥。文章首先分析流水线的结构特点,确定Cache的结构功能,在此基础上提出一个组相联映射Cache的设计。分析Cache实现读写操作的具体控制过程,并给出LRU(least recently used)替换算法的实现。最后通过介绍猝发取指操作着重讨论了Cache与流水线间的配合机制。相似文献

6.

面向Cell宽带引擎架构的异构多核访存技术 总被引：4，自引：1，他引：3

冯国富董小社丁彦飞王旭昊《西安交通大学学报》2009,43(2)

针对Cell宽带引擎架构(CBEA)多核高性能处理器要求软件显式地对分层存储结构进行管理,带来架构的可编程性及性能等问题,提出了一种基于CBEA的异构多核访存技术.将CBEA访存分为批量访存和按需访存;通过合理部署数据缓冲区来减小批量访存计算中的片内访存开销,利用支持粗粒度访问的软件管理cache及数据预取来降低按需访存的片外访存开销;以访存接口库的方式来改善软件的可编程性.实验结果表明,所提技术的访存接口库在批量访存方式下的性能比ALF和CellSs提高了30%～50%,按需访存中软件管理cache性能比CBE软件开发工具包提高了20%～30%,4路数据预取访存比单路缓存的性能提高约50%. 相似文献

7.

一种面向非规则数据的阶段预取策略

黄艳张小军《河南大学学报(自然科学版)》2015,45(4)

传统数据预取技术在处理结构复杂的非规则数据应用程序时,其有效性明显下降.为解决该问题,基于程序运行时的数据访问阶段性特征,提出一种面向非规则数据的阶段预取策略,研究应用程序的访存规律和预取调度机制.该策略通过在线剖析应用程序的访存行为,识别出数据访问性能指标表现稳定的数据访问阶段和具有特定访存行为特征的预取阶段,实现在数据访问阶段内依据预取阶段的访存规律动态调整预取操作.实验结果表明,与传统的基于访存流模型的数据预取技术相比较,阶段预取策略能够减少无用预取,更加有效地改善非规则数据应用程序性能. 相似文献

8.

分离Cache的一种容量联合分配算法

彭蔓蔓郝玉艳任小西《湖南大学学报(自然科学版)》2009,36(12)

在嵌入式处理器中,Cache的功耗所占的比重越来越大.针对不同类型的应用程序对指令Cache和数据Cache的容量实时需求不同,提出了一种新的容量联合分配算法,该算法可以均衡考虑程序运行时对指令 Cache和数据Cache的实时需求,动态调整一级Cache的容量和配置,从而更有效地利用Cache资源.Mibench仿真结果表明,采用容量联合分配算法的分离Cache与传统分离Cache相比,平均能量消耗降低了29.10%,平均能量延迟积降低了33.38%. 相似文献

9.

面向媒体处理可重构系统中数据缓存结构和缓存管理策略优化

刘波肖建曹鹏杨苗苗《东南大学学报(自然科学版)》2014,(6):1149-1154

研究并提出了一种基于二维访问机制的数据缓存结构(2D Cache)及其更新管理策略.该缓存结构可以在控制硬件存储开销的同时,有效提升可重构系统的数据访存效率.实验结果表明,仅需4 KB的数据缓存开销,可重构系统的访存性能提升了29.16%~35.65%,并且对于不同标准的媒体处理算法都能获得较好的优化效果,具有很好的适应性.芯片实测结果表明,采用所述数据缓存设计方案的可重构系统可以在200 MHz下满足1080p@30fps的实时解码需求,与国际同类架构相比,性能提高了1.8倍以上. 相似文献

10.

新兴多核工作负载访存行为的定量分析

林隽民陈彧李文龙乔林汤志忠《清华大学学报(自然科学版)》2011,(8)

工作负载分析是片上多处理器末级缓存设计的关键先导工作。分析了一组访存密集型多线程RMS(recognition-mining-synthesis)工作负载工作集大小、数据共享行为和空间局部性等访存行为,研究了末级缓存的设计空间,探讨了未来片上多处理器的缓存体系结构设计。实验结果表明:大容量DRAM缓存有助于满足这组负载的大工作集对缓存容量的需求,使用128MB DRAM缓存比不使用时平均可以减少18%的L1缓存缺失延迟;共享缓存设计比私有设计性能更好,8MB的共享缓存可以比相同总容量的私有缓存提高25%的缓存性能;基于步长的硬件数据预取机制可以提高25%的性能。因此,对于访存密集型RMS负载,宜采用一个128MB的DRAM缓存、一个8MB片上SRAM缓存,结合一个8表项的流式预取器,构成缓存子系统。相似文献

11.

一种支持数据渗透迁移的片上缓存模型研究

胡九川范东睿李丹萍严龙叶笑春《北京交通大学学报(自然科学版)》2017,41(5)

分析一种支持数据在处理器片上如流水般浸润迁移的渗透缓存层次模型,以及片上数据渗透迁移的基本算法.为了仿真验证渗透缓存模型的有效性、分析该模型及其上的数据迁移算法的性质,本文给出了描述渗透迁移模型基本结构的构成关系、渗透迁移数据的形式化方法.仿真实验结果表明:该模型在改进处理器访存的命中率方面具有明显优势. 相似文献

12.

Efficiency of Cache Mechanism for Network Processors 总被引：1，自引：0，他引：1

徐波常剑黄诗萌薛一波李军《清华大学学报》2009,14(5):575-585

With the explosion of network bandwidth and the ever-changing requirements for diverse network-based applications, the traditional processing architectures, i.e., general purpose processor (GPP) and application specific integrated circuits (ASIC) cannot provide sufficient flexibility and high performance at the same time. Thus, the network processor (NP) has emerged as an alternative to meet these dual demands for today's network processing. The NP combines embedded multi-threaded cores with a rich memory hierarchy that can adapt to different networking circumstances when customized by the application developers. In today's NP architectures, multithreading prevails over cache mechanism, which has achieved great success in GPP to hide memory access latencies. This paper focuses on the efficiency of the cache mechanism in an NP. Theoretical timing models of packet processing are established for evaluating cache efficiency and experiments are performed based on real-life network backbone traces. Testing results show that an improvement of nearly 70% can be gained in throughput with assistance from the cache mechanism. Accordingly, the cache mechanism is still efficient and irreplaceable in network processing, despite the existing of multithreading. 相似文献

13.

R-Memcached: A Reliable In-Memory Cache for Big Key-Value Stores

《清华大学学报》2015,(6)

Large-scale key-value stores are widely used in many Web-based systems to store huge amount of data as(key, value) pairs. In order to reduce the latency of accessing such(key, value) pairs, an in-memory cache system is usually deployed between the front-end Web system and the back-end database system. In practice, a cache system may consist of a number of server nodes, and fault tolerance is a critical feature to maintain the latency Service-Level Agreements(SLAs). In this paper, we present the design, implementation, analysis, and evaluation of R-Memcached, a reliable in-memory key-value cache system that is built on top of the popular Memcached software. R-Memcached exploits coding techniques to achieve reliability, and can tolerate up to two node failures.Our experimental results show that R-Memcached can maintain very good latency and throughput performance even during the period of node failures. 相似文献

14.

The DPGA for Conbining the Superscalar and Multithreaded Processors Principal

Abdelkadel Chaib Hu Mingzeng 《高技术通讯(英文版)》2001,7(1)

The performance of scalable shared-memory multiprocessors suffers from three types of latency; memory latency, the latency caused by inter-process synchroni z ation ,and the latency caused by instructions that take multiple cycles to produ ce results. To tolerate these three types of latencies, The followin g techniques was proposed to couple: coarse-grained multithreading, the supersc alar processor and a rec onfigurable device, namely the overlapping long latency operations of one thread of computation with the execution of other threads. The superscalar processor p rinciple is used to tolerate instruction latency by issuing several instructions simultaneously. The DPGA is coupled with this processor in order to improve th e context-switching overhead. 相似文献

15.

一种多核系统上基于页着色的内存管理方法

张轶关楠王义《东北大学学报(自然科学版)》2014,35(3):351-355

当今多核平台多采用共享cache架构,但运行在不同核心上的任务产生的cache冲突问题使得程序最坏执行时间的计算变得十分困难.因此提出了使用页着色技术解决多核cache上访存冲突问题的方法.此方法的优势是使已有单核上的WCET分析技术可以对多核上的程序执行时间进行判断.在Linux系统上实现了支持页着色划分方法的内存管理系统,并使用通用测试集对该方法进行了测试.实验结果表明,在Linux系统中使用该内存管理策略后,在相同多核平台上程序的执行时间变得可预测. 相似文献

16.

共享存储器型超并行多处理机系统的模拟性能评价

赵保华《中国科学技术大学学报》1994,24(3):373-377

在共享存储器超并行多处理机系统中，访问共享存储器的吞吐量相当大，而存取延迟是与多级互连网络的级数成正比的．为防止这种延迟而产生的处理效率下降，把整个系统视为一个流水线处理体系，同时配合使用高速缓冲存储器．本文给出超并行多处理机系统的模拟性能评价结果．相似文献

17.

A Holistic Energy-Efficient Approach for a Processor-Memory System

Feihao Wu Juan Chen Yong Dong Wenxu Zheng Xiaodong Pan Yuan Yuan Zhixin Ou Yuyang Sun 《清华大学学报》2019,(4)

Component overclocking is an effective approach to speed up the components of a system to realize a higher program performance; it includes processor overclocking or memory overclocking. However, overclocking will unavoidably result in increase in power consumption. Our goal is to optimally improve the performance of scientific computing applications without increasing the total power consumption for a processor-memory system. We built a processor-memory energy efficiency model for multicore-based systems, which coordinates the performance and power of processor and memory. Our model exploits performance boost opportunities for a processor-memory system by adopting processor overclocking, processor Dynamic Voltage and Frequency Scaling(DVFS), memory active ratio adjustment, and memory overclocking, according to different scientific applications.This model also provides a total power control method by considering the same four factors mentioned above. We propose a processor and memory Coordination-based holistic Energy-Efficient(CEE) algorithm, which achieves performance improvement without increasing the total power consumption. The experimental results show that an average of 9.3% performance improvement was obtained for all 14 benchmarks. Meanwhile the total power consumption does not increase. The maximal performance improvement was up to 13.1% from dedup benchmark.Our experiments validate the effectiveness of our holistic energy-efficient model and technology. 相似文献

18.

Design and Implementation of Hierarchy Cache Using Pagefile

XIEChang-sheng LIURui-fang TANZhi-hu 《武汉大学学报:自然科学英文版》2004,9(6):890-894

This paper presents a novel hierarchy cache architecture for the purpose of optimizing IO performance. The main idea of the hierarchy cache is to use a few megabytes of RAM and a pagefile to form a two-level cache architecture. The pagefile is equivalent to the cache disk in DCD(Disk Caching Disk). The pagefile outperforms data disks, because data are accessed in different units and different ways. Small writes are collected in the RAM cache first, and data will be transferred to the pagefile in large writes later. When the system is idle, it will destage data from the pagefile to data disks. The performance test results show that the hierarchy cache can improve IO performance dramatically for small writes, and the mail server using the hierarchy cache driver can handle transactions about 2.2 times faster than the normal mail server. The hierarchy cache is implemented as a filter driver, so it‘s transparent to the current Windows 2000/Windows XP operating system. 相似文献

19.

Feedback Cache Mechanism for Dynamically Reconfigurable VLIW Processors

《清华大学学报》2017,(3)

Very Long Instruction Word(VLIW) architectures are commonly used in application-specific domains due to their parallelism and low-power characteristics. Recently, parameterization of such architectures allows for runtime adaptation of the issue-width to match the inherent Instruction Level Parallelism(ILP) of an application.One implementation of such an approach is that the event of the issue-width switching dynamically triggers the reconfiguration of the data cache at runtime. In this paper, the relationship between cache resizing and issue-width is well investigated. We have observed that the requirement of the cache does not always correlate with the issuewidth of the VLIW processor. To further coordinate the cache resizing with the changing issue-width, we present a novel feedback mechanism to "block" the low yields of cache resizing when the issue-width changes. In this manner, our feedback cache mechanism has a coordinated effort with the issue-width changes, which leads to a noticeable improvement of the cache performance. The experiments show that there is 10% energy savings as well as a 2.3% cache misses decline on average achieved, compared with the cache without the feedback mechanism.Therefore, the feedback mechanism is proven to have the capability to ensure more benefits are achieved from the dynamic and frequent reconfiguration. 相似文献