首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
摘要:
提出了针对多核处理器的2级缓存L2 Cache设计方案,以高效地处理访存请求.采用优化的目录协议维护与1级缓存L1 Cache的数据一致性,并结合片上目录来维护L2 Cache之间及其与3级缓存L3 Cache之间的一致性;在L2 Cache设计中,提出了基于MESIA F的Cache一致性协议,实现了最早返回取数数据的短流水线设计;采用相关链和远程链机制解决了监听应答导致的死锁问题;通过基于流水线的睡眠与唤醒技术降低了漏流功耗;通过细粒度门控时钟降低了其动态功耗.后端设计结果表明,经过优化设计的L2 Cache达到了频率2 GHz的设计目标,并已成功应用于某16核处理器芯片. 关键词:
中图分类号: 文献标志码: A  相似文献   

2.
针对嵌入式单核处理器处理速度慢及主频提升受限等问题,提出了嵌入式双核处理器(two-cores embedded processor,TEP)模型.针对处理器运行时对存储器的依赖和分配问题,提出了基于非统一存储结构模拟分布式存储结构的方案;针对多核间对共享数据存储器的访存问题,给出了从属单元的仲裁机制,实现了共享资源的访问;针对面向多媒体应用的多核处理器间传输数据量大及通讯开销高的问题,提出了基于消息数据分离的传输方案.系统在FPGA平台进行了实现和验证,测试结果表明,TEP系统以较少的资源消耗和通讯开销获得了大加速比的性能.  相似文献   

3.
面向按序执行处理器开展预执行机制的设计空间探索, 并对预执行机制的优化效果随 Cache 容量和访存延时的变化趋势进行了量化分析。实验结果表明, 对于按序执行处理器, 保存并复用预执行期间的有效结果和在预执行访存指令之间进行数据传递都能够有效地提升处理器性能, 前者还能够有效地降低能耗开销。将两者相结合使用, 在平均情况下将基础处理器的性能提升 24. 07% , 而能耗仅增加 4. 93% 。进一步发现, 在 Cache 容量较大的情况下, 预执行仍然能够带来较大幅度的性能提升。并且, 随着访存延时的增加, 预执行在提高按序执行处理器性能和能效性方面的优势都将更加显著。  相似文献   

4.
面向按序执行处理器开展预执行机制的设计空间探索,并对预执行机制的优化效果随 Cache 容量和访存延时的变化趋势进行了量化分析。实验结果表明,对于按序执行处理器,保存并复用预执行期间的有效结果和在预执行访存指令之间进行数据传递都能够有效地提升处理器性能,前者还能够有效地降低能耗开销。将两者相结合使用,在平均情况下将基础处理器的性能提升 24. 07% ,而能耗仅增加 4. 93% 。进一步发现,在 Cache 容量较大的情况下,预执行仍然能够带来较大幅度的性能提升。并且,随着访存延时的增加,预执行在提高按序执行处理器性能和能效性方面的优势都将更加显著。  相似文献   

5.
针对众核处理器,提出了一种基于计算资源划分机制的动态可重构技术.该技术以虚拟计算群为核心,设计了基于硬件支持的动态可重构子网划分和动态可重构的Cache一致性协议以及动态在线的计算资源调度算法,并对系统级多核仿真平台Gem 5进行了扩展.同时,采用实际测试结果验证了众核处理器中动态可重构技术的有效性.结果表明,动态可重构技术可以提高众核处理器的资源利用率,实现动态可重构的Cache一致性协议以及单一矩形物理子网覆盖的子网划分机制.  相似文献   

6.
线性方程组求解在科学与工程计算领域具有广泛的应用.文章依据多核计算机共享二级缓存和私有一级缓存的容量,采取将线性方程组的增广矩阵按行划分并合理地分布存储到各级缓存中,各个处理核以多线程方式并行计算矩阵行的方法,给出了一种在多核计算机上实现的线程级并行求解n阶线性方程组的算法.实验结果表明,与原Gauss-Seidel并...  相似文献   

7.
工作负载分析是片上多处理器末级缓存设计的关键先导工作。分析了一组访存密集型多线程RMS(recognition-mining-synthesis)工作负载工作集大小、数据共享行为和空间局部性等访存行为,研究了末级缓存的设计空间,探讨了未来片上多处理器的缓存体系结构设计。实验结果表明:大容量DRAM缓存有助于满足这组负载的大工作集对缓存容量的需求,使用128MB DRAM缓存比不使用时平均可以减少18%的L1缓存缺失延迟;共享缓存设计比私有设计性能更好,8MB的共享缓存可以比相同总容量的私有缓存提高25%的缓存性能;基于步长的硬件数据预取机制可以提高25%的性能。因此,对于访存密集型RMS负载,宜采用一个128MB的DRAM缓存、一个8MB片上SRAM缓存,结合一个8表项的流式预取器,构成缓存子系统。  相似文献   

8.
层次化片上多核处理器紧耦合多个处理核构成"簇节点",对访存和片上通信的局部性有良好支撑,能有效地缓解片上多核间数据通信带来的通信开销。文章通过构建精细的层次化片上多核处理器仿真器,利用随机任务模型研究"簇节点"大小对系统性能的影响。仿真发现,一定系统规模下,要获得良好的系统性能,层次化片上多核处理器需要在"簇节点"数目与"簇节点"的大小(节点内处理核的数目)之间仔细权衡。  相似文献   

9.
面向Cell宽带引擎架构的异构多核访存技术   总被引:4,自引:1,他引:3  
针对Cell宽带引擎架构(CBEA)多核高性能处理器要求软件显式地对分层存储结构进行管理,带来架构的可编程性及性能等问题,提出了一种基于CBEA的异构多核访存技术.将CBEA访存分为批量访存和按需访存;通过合理部署数据缓冲区来减小批量访存计算中的片内访存开销,利用支持粗粒度访问的软件管理cache及数据预取来降低按需访存的片外访存开销;以访存接口库的方式来改善软件的可编程性.实验结果表明,所提技术的访存接口库在批量访存方式下的性能比ALF和CellSs提高了30%~50%,按需访存中软件管理cache性能比CBE软件开发工具包提高了20%~30%,4路数据预取访存比单路缓存的性能提高约50%.  相似文献   

10.
VLSI技术进步和应用驱动使多核技术成为主流的微处理器设计技术。多核处理器作为一种时空域器件,应把超级计算机作为多核处理器的设计参考系,其主流架构将最终收敛到"小核、大阵列、层次化"上。文章利用Xilinx Virtex5-330TFPGA器件,设计实现了一款集成16个处理核的具备层次化架构特征的嵌入式多核处理器原型芯片,工作频率为90 MHz。多核处理器利用层次化的体系架构、灵活的片上互连、多种同步机制以及合理的并行程序模型,成功加载了实时视频淡入淡出(fade-in-fade-out)混叠应用(320×240,30帧/s)。基于该多核处理器架构,研究比较了粗粒度和细粒度2种并行编程模型。细粒度模型的多核同步操作稍复杂,但很好地掩盖了应用的串行操作时间,对视频淡入淡出混叠应用的加速比可达6.97。  相似文献   

11.
N J Emery  N S Clayton 《Nature》2001,414(6862):443-446
Social life has costs associated with competition for resources such as food. Food storing may reduce this competition as the food can be collected quickly and hidden elsewhere; however, it is a risky strategy because caches can be pilfered by others. Scrub jays (Aphelocoma coerulescens) remember 'what', 'where' and 'when' they cached. Like other corvids, they remember where conspecifics have cached, pilfering them when given the opportunity, but may also adjust their own caching strategies to minimize potential pilfering. To test this, jays were allowed to cache either in private (when the other bird's view was obscured) or while a conspecific was watching, and then recover their caches in private. Here we show that jays with prior experience of pilfering another bird's caches subsequently re-cached food in new cache sites during recovery trials, but only when they had been observed caching. Jays without pilfering experience did not, even though they had observed other jays caching. Our results suggest that jays relate information about their previous experience as a pilferer to the possibility of future stealing by another bird, and modify their caching strategy accordingly.  相似文献   

12.
针对大数据负载时磁盘I/O阻塞造成的Web服务器性能下降的问题,提出了应用程序控制缓冲(ACC)方法.其核心是,缓冲跟踪模块根据应用程序的文件访问过程来跟踪内核中的文件缓冲状态,缓冲控制模块进行缓冲替换和预取,保持文件缓冲有足够的空闲空间.这样,服务器可在用户空间控制文件缓冲,从而准确判断文件是否在缓冲之中,并依此来调度请求,以提高处理器和磁盘的I/O并行度.同时,服务器可采用适应自身特点的缓冲和预读策略,以提高缓冲的命中率.作为示例,将ACC在Flash服务器中实现,实现中选用了“金字塔选择”缓冲算法.实验表明,在大数据负载下使用ACC的Flash服务器性能有很大的提高,即便在数据负载稍大于物理内存空间的情况下,服务器的吞吐率仍可提高约24.4%,而当数据负载超出物理内存2~3倍时,吞吐率可提高3~4倍。  相似文献   

13.
结合对象存储系统的数据访问模式,综合设计客户端和元数据服务的缓存,构造存储系统的合作缓存方案.该方案将客户端和元数据服务器的缓存作为整体进行设计,以达到提高缓存利用率的目的;通过缓存准入策略合理选择数据传送模式,减少数据传送的通信量;同时,合作缓存方案根据数据对象的大小、访问成本和网络负载动态地调整缓存策略,提高存储系统的服务质量.实验显示,合作缓存方案能较好地适应不同的工作负载,有效提高了系统的输入输出性能.  相似文献   

14.
The Web cluster has been a popular solution of network server system because of its scalability and cost effective ness. The cache configured in servers can result in increasing significantly performance, In this paper, we discuss the suitable configuration strategies for caching dynamic content by our experimental results. Considering the system itself can provide support for caching static Web page, such as computer memory cache and disk's own cache, we adopt a special pattern that only caches dynamic Web page in some experiments to enlarge cache space. The paper is introduced three different replacement algorithms in our cache proxy module to test the practical effects of caching dynamic pages under different conditions. The paper is chiefly analyzed the influences of generated time and accessed frequency on caching dynamic Web pages. The paper is also provided the detailed experiment results and main conclusions in the paper.  相似文献   

15.
Efficiency of Cache Mechanism for Network Processors   总被引:1,自引:0,他引:1  
With the explosion of network bandwidth and the ever-changing requirements for diverse network-based applications, the traditional processing architectures, i.e., general purpose processor (GPP) and application specific integrated circuits (ASIC) cannot provide sufficient flexibility and high performance at the same time. Thus, the network processor (NP) has emerged as an alternative to meet these dual demands for today's network processing. The NP combines embedded multi-threaded cores with a rich memory hierarchy that can adapt to different networking circumstances when customized by the application developers. In today's NP architectures, multithreading prevails over cache mechanism, which has achieved great success in GPP to hide memory access latencies. This paper focuses on the efficiency of the cache mechanism in an NP. Theoretical timing models of packet processing are established for evaluating cache efficiency and experiments are performed based on real-life network backbone traces. Testing results show that an improvement of nearly 70% can be gained in throughput with assistance from the cache mechanism. Accordingly, the cache mechanism is still efficient and irreplaceable in network processing, despite the existing of multithreading.  相似文献   

16.
针对网络存储中I/O的瓶颈问题,设计了一个基于网络存储的分布式I/O缓存机制,通过本地缓存和远程缓存的两级缓存机制进行I/O性能的优化.其中本地缓存用来保存本地磁盘的读写信息,远程缓存用来协调远程机器的本地缓存.针对以上的缓存机制,设计了相应的数据块更新算法和缓存一致性策略,有效地保证了I/O缓存的性能.  相似文献   

17.
当今多核平台多采用共享cache架构,但运行在不同核心上的任务产生的cache冲突问题使得程序最坏执行时间的计算变得十分困难.因此提出了使用页着色技术解决多核cache上访存冲突问题的方法.此方法的优势是使已有单核上的WCET分析技术可以对多核上的程序执行时间进行判断.在Linux系统上实现了支持页着色划分方法的内存管理系统,并使用通用测试集对该方法进行了测试.实验结果表明,在Linux系统中使用该内存管理策略后,在相同多核平台上程序的执行时间变得可预测.  相似文献   

18.
Aiming at the fact that traditional cache replacement strategy lacks pertinence to the semantic cache in the process of extensible markup language(XML) algebra query, a replacement strategy based on the semantic cache contribution value is proposed. First, pattern matching rules for XML algebra query and semantic caches are given. Second, the method of calculating the semantic cache contribution value is proposed. In XML documents with four different sizes, the experimental results of time efficiency show that this strategy supports environment of the XML algebra query and it has better time efficiency than both least frequency used(LFU) and least recently used(LRU).  相似文献   

19.
Very Long Instruction Word(VLIW) architectures are commonly used in application-specific domains due to their parallelism and low-power characteristics. Recently, parameterization of such architectures allows for runtime adaptation of the issue-width to match the inherent Instruction Level Parallelism(ILP) of an application.One implementation of such an approach is that the event of the issue-width switching dynamically triggers the reconfiguration of the data cache at runtime. In this paper, the relationship between cache resizing and issue-width is well investigated. We have observed that the requirement of the cache does not always correlate with the issuewidth of the VLIW processor. To further coordinate the cache resizing with the changing issue-width, we present a novel feedback mechanism to "block" the low yields of cache resizing when the issue-width changes. In this manner, our feedback cache mechanism has a coordinated effort with the issue-width changes, which leads to a noticeable improvement of the cache performance. The experiments show that there is 10% energy savings as well as a 2.3% cache misses decline on average achieved, compared with the cache without the feedback mechanism.Therefore, the feedback mechanism is proven to have the capability to ensure more benefits are achieved from the dynamic and frequent reconfiguration.  相似文献   

20.
为减少嵌入式系统片上多核互连网络的动态能耗,提出了一种基于频繁交换值的多核交叉开关节能方法.利用片上多核互连网络中值的局部性,设计了频繁交换值缓存(FEVC),通过减少互连链路上的通信量和位变换数,有效降低了片上交叉开关互连网络的动态能耗.为达到最佳节能效果,通过实验确定了FEVC中保存的值的个数.实验结果表明与原系统相比,在单独使用频繁交换值缓存,只保存4个数据值时可以实现节能13%,结合翻转码算法可使节能比例达到20%.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号