【论文阅读】SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs

2023-03-09 11:32:47 浏览数 (1)

本次是初步写论文记录,以翻译为主,后续会更改为只讲述核心思想。

欢迎访问 Github :https://github.com/MercuryLc/paper_reading

SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs

本论文讲解了如何检测、诊断锁的争用带来的性能问题。

Abstract

Despite the obvious importance, performance issues related to synchronization primitives are still lacking adequate attention. No literature extensively investigates categories, root causes, and fixing strategies of such performance issues. Existing work primarily focuses on one type of problems, while ignoring other important categories. Moreover, they leave the burden of identifying root causes to programmers.

尽管具有明显的重要性,但与同步原语相关的性能问题仍然缺乏足够的关注。没有文献广泛调查此类性能问题的类别、根本原因和解决策略。 现有的工作主要集中在一种类型的问题上,而忽略了其他重要的类别。 此外,它们将确定根本原因的负担留给了程序员。

This paper first conducts an extensive study of categories, root causes, and fixing strategies of performance issues related to explicit synchronization primitives. Based on this study, we develop two tools to identify root causes of a range of performance issues. Compare with existing work, our proposal, SyncPerf, has three unique advantages. First, SyncPerf’s detection is very lightweight, with 2.3% performance overhead on average.

本文首先对与显式同步原语相关的性能问题的类别、根本原因和修复策略进行了广泛的研究。 基于这项研究,我们开发了两种工具来确定一系列性能问题的根本原因。 与现有工作相比,我们的提案 SyncPerf 具有三个独特的优势。 首先,SyncPerf 的检测非常轻量级,平均有 2.3% 的性能开销。

Second, SyncPerf integrates information based on callsites, lock variables, and types of threads. Such integration helps identify more latent problems. Last but not least, when multiple root causes generate the same behavior, SyncPerf provides a second analysis tool that collects detailed accesses inside critical sections and helps identify possible root causes. SyncPerf discovers many unknown but significant synchronization performance issues. Fixing them provides a performance gain anywhere from 2.5% to 42%. Low overhead, better coverage, and informative reports make SyncPerf an effective tool to find synchronization performance bugs in the production environment.

其次,SyncPerf 集成了基于调用点、锁变量和线程类型的信息。 这种集成有助于识别更多潜在问题。 最后但同样重要的是,当多个根本原因产生相同的行为时,SyncPerf 提供了第二个分析工具,该工具收集关键部分内的详细访问并帮助识别可能的根本原因。 SyncPerf 发现了许多未知但重要的同步性能问题。 修复它们可以提高 2.5% 到 42% 的性能。 低开销、更好的覆盖范围和信息丰富的报告使 SyncPerf 成为在生产环境中查找同步性能错误的有效工具。

1. Introduction

Designing efficient multithreaded programs while maintaining their correctness is not an easy task. Performance issues of multithreaded programs, despite being the primary cause of more than 22% synchronization fixes of server programs [23], get less attention than they deserve. There are many prior works [8, 19, 21, 23, 28, 41, 43] in this domain, but none of them systematically investigates performance issues related to different types of synchronization primitives. Most existing works cannot identify root causes, and provide helpful fixing strategies.

设计高效的多线程程序同时保持其正确性并非易事。 多线程程序的性能问题,尽管是超过 22% 的服务器程序同步修复的主要原因 [23],但得到的关注却比他们应得的要少。 在这个领域有许多先前的工作 [8, 19, 21, 23, 28, 41, 43],但没有一个系统地研究与不同类型的同步原语相关的性能问题。 大多数现有作品无法确定根本原因,并提供有用的修复策略。

This paper studies various categories, root causes, and fixing strategies of performance issues related to different synchronization primitives such as locks, conditional variables, and barriers. Lock/wait-free techniques and other mechanisms (such as transactional memory [20]) are not covered in this paper. The study divides synchronization related performance issues into five categories: improper primitives, improper granularity, over-synchronization, asymmetric contention, and load imbalance (shown in Table 1). The first four categories are related to various locks, whereas the last one is related to other synchronizations such as conditional variables and barriers.

本文研究了与锁、条件变量和屏障等不同同步原语相关的性能问题的各种类别、根本原因和修复策略。 本文不涉及无锁/无等待技术和其他机制(如事务内存 [20])。 该研究将与同步相关的性能问题分为五类:不适当的原语、不适当的粒度、过度同步、不对称争用和负载不平衡(如表 1 所示)。 前四类与各种锁有关,而最后一类与条件变量和屏障等其他同步有关。

The study shows that the same symptom can be caused by multiple root causes. For example, high contention of locks can occur due to too many data items under the same lock, too-large critical sections, over-synchronization, or asymmetric lock contention (more details in Section 2). Without knowing the root cause, it is difficult for programmers to fix these bugs effectively. The study also shows that different categories of problems may have different symptoms and thus, different solutions. Finally, the study presents some ideas for identifying and fixing these performance issues. The study not only helps users to identify and fix synchronization performance issues, but also enables future research in this domain.

研究表明,同一症状可能由多种根本原因引起。 例如,由于同一锁下的数据项过多、临界区过大、过度同步或非对称锁争用(第 2 节中的更多详细信息),可能会发生锁的高争用。 在不知道根本原因的情况下,程序员很难有效地修复这些错误。 该研究还表明,不同类别的问题可能有不同的症状,因此可能有不同的解决方案。 最后,该研究提出了一些用于识别和解决这些性能问题的想法。 该研究不仅可以帮助用户识别和修复同步性能问题,还可以促进该领域的未来研究。

Prior work [8, 19, 21, 23, 28, 41, 43] focuses excessively on locks that are both acquired frequently and highly contended. Our first observation is that performance problems can also occur with locks that are not excessively acquired or highly contended. This is shown in Figure 1. Existing work focuses on quadrant 2 or Q2. Locks of Q2 can definitely cause performance issues but they are not the only culprits.

先前的工作 [8, 19, 21, 23, 28, 41, 43] 过分关注频繁获取且竞争激烈的锁。 我们的第一个观察是,性能问题也可能发生在没有过度获取或高度竞争的锁上。 这如图 1 所示。现有工作集中在象限 2 或 Q2。 Q2 的锁肯定会导致性能问题,但它们并不是唯一的罪魁祸首。

SyncPerf finds potential problems with the other two quadrants: (i) locks that are not acquired many times may slow down a program if the critical sections are large and potentially introduce high contention and/or a long waiting time (Q1); (ii) locks that are acquired excessively may cause significant performance problems, even if they are barely contended (Q4). Intuitively, locks of Q3 (lowly contended and not acquired many times) will not cause performance problems. Our second observation is that it is not always sufficient to identify root causes of a problem based on the behavior of a single synchronization. For example, for asymmetric contention where different locks are protecting similar data with different contention rates, we have to analyze the behavior of all those locks that typically have the same initialization and acquisition sites. By checking all of those locks together, we can notice that some locks may have higher contention and acquisition than others.

SyncPerf 发现其他两个象限的潜在问题:(i)如果关键部分很大并且可能引入高争用和/或长等待时间(Q1),则未多次获取的锁可能会减慢程序速度; (ii) 过度获取的锁可能会导致严重的性能问题,即使它们几乎没有竞争(Q4)。 直观上,Q3 的锁(低竞争且未多次获取)不会导致性能问题。 我们的第二个观察结果是,根据单个同步的行为来确定问题的根本原因并不总是足够的。 例如,对于不同的锁以不同的争用率保护相似数据的非对称争用,我们必须分析所有通常具有相同初始化和获取站点的锁的行为。 通过检查所有这些锁,我们可以注意到某些锁可能比其他锁具有更高的争用和获取。

  • 在对调用栈中锁的申请中,可以分析得到申请锁的占用率。

Driven by these two intuitive but novel observations, we develop SyncPerf that not only reports the callsites of performance issues, but also helps diagnose root causes and suggests possible fixes for a range of performance issues related to synchronization primitives. Most existing works [8, 28, 41] just report the callsites of the performance issues (mostly high contention locks), while leaving the burden of analyzing root causes (and finding possible fixes) to programmers. The only work similar to ours was proposed by Yu et al. [43]. However, SyncPerf excels it by having a better detection ability (thanks to the novel observations), a broader scope, and much lower overhead (Section 6).

在这两个直观但新颖的观察结果的驱动下,我们开发了 SyncPerf,它不仅报告性能问题的调用点,还有助于诊断根本原因,并针对与同步原语相关的一系列性能问题提出可能的修复建议。 大多数现有工作 [8, 28, 41] 只报告性能问题的调用点(主要是高争用锁),而将分析根本原因(并找到可能的修复)的负担留给程序员。 Yu等人提出了唯一与我们类似的工作。 [43]。 然而,SyncPerf 的优势在于具有更好的检测能力(感谢新的观察)、更广泛的范围和更低的开销(第 6 节)。

SyncPerf starts by monitoring the execution of an application and collecting information about explicit synchronization primitives. More specifically, it collects (i) for a lock, how many times it is acquired, how many times it is found to be contended, and how long a thread waits for the lock, (ii) for a try-lock, how many times it is called and how many times it fails because of contention, and finally (iii) for load imbalance, how long different threads execute, and how long they are waiting for synchronizations. SyncPerf also collects callsites for each synchronization operation and thread creation function to help pinpoint the actual problems. After this, SyncPerf integrates and checks the collected information to identify root causes: (i) it checks behavior of all locks with the same callsites to identify asymmetric contention issue, (ii) it computes and compares waiting time of different threads to identify load imbalance issue, and (iii) it checks individual as well as collective (based on callsites) information of locks (i.e., the number of acquisitions and number of times they are contended) to identify other performance issues. This integration is very important, and helps uncover more performance issues. SyncPerf is able to find more performance issues than any prior work (Table 2). For some of the problems, such as asymmetric contention, and load imbalance, SyncPerf’s detection tool automatically reports root causes. It also presents an optimal task assignment to solve load imbalance problems. For other problems, SyncPerf provides sufficient information as well as an informal guideline to diagnose them manually. SyncPerf also provides an additional optional tool (that programmers can use offline) to help the diagnosis process.

SyncPerf 首先监视应用程序的执行并收集有关显式同步原语的信息。更具体地说,它收集(i)一个锁,它被获取多少次,发现它被竞争多少次,以及线程等待锁的时间,(ii)一个try-lock,有多少它被调用的次数以及由于争用而失败的次数,最后(iii)负载不平衡,不同线程执行多长时间,以及它们等待同步多长时间。

SyncPerf 还收集每个同步操作和线程创建函数的调用点,以帮助查明实际问题。在此之后,SyncPerf 集成并检查收集的信息以识别根本原因:(i) 它检查具有相同调用点的所有锁的行为以识别不对称争用问题,(ii) 它计算并比较不同线程的等待时间以识别负载不平衡问题,以及 (iii) 它检查锁的单个和集体(基于调用站点)信息(即,获取次数和争用次数)以识别其他性能问题。这种集成非常重要,有助于发现更多的性能问题。

SyncPerf 能够发现比之前任何工作都更多的性能问题(表 2)。对于一些问题,例如不对称争用和负载不平衡,SyncPerf 的检测工具会自动报告根本原因。它还提出了解决负载不平衡问题的最佳任务分配。对于其他问题,SyncPerf 提供了足够的信息以及手动诊断它们的非正式指南。 SyncPerf 还提供了一个额外的可选工具(程序员可以离线使用)来帮助诊断过程。

contribution

This paper provides a taxonomy of categories, root causes, and fixing strategies of performance bugs related to explicit synchronization primitives. The taxonomy is useful not only to identify and fix synchronization performance problems but also to enable future research in this field.

本文提供了与显式同步原语相关的性能错误的类别、根本原因和修复策略的分类。 该分类法不仅有助于识别和修复同步性能问题,而且有助于该领域的未来研究。

SyncPerf uses an intuitive observation that performance problems may occur even when locks are not frequently acquired or highly contended. There is no existing work that actually uses this observation. Due to this observation, SyncPerf finds many previously unknown performance issues in widely used applications.

SyncPerf 使用直观的观察结果,即即使不经常获取或高度争用锁,也可能出现性能问题。 没有实际使用这种观察的现有工作。 由于这一观察,SyncPerf 在广泛使用的应用程序中发现了许多以前未知的性能问题。

SyncPerf makes a novel observation that it is hard to detect problems such as asymmetric contention and load imbalance by observing the behavior of a single synchronization. To solve this problem, SyncPerf proposes to integrate information based on callsites of lock acquisitions (and initializations), lock variables, and types of threads. This integration also contributes to the detection of some unknown issues.

SyncPerf 提出了一个新颖的观察,即通过观察单个同步的行为很难检测到不对称争用和负载不平衡等问题。 为了解决这个问题,SyncPerf 建议整合基于锁获取(和初始化)的调用点、锁变量和线程类型的信息。 这种集成还有助于检测一些未知问题。

Finally, SyncPerf provides two tools that help diagnose root causes of performance bugs. The first one is a detection tool that can report susceptible callsites and synchronization variables with potential performance issues, and identify some root causes such as asymmetric contention and load imbalance. This tool has extremely low overhead (only 2.3%, on average). The tool achieves such low overhead even without using the sampling mechanism. The low overhead makes the tool a good candidate for the deployment environment. When multiple root causes may lead to the same behavior and thus, cannot be diagnosed easily, SyncPerf provides a heavyweight diagnosis tool that collects detailed accesses inside susceptible critical sections to ease the diagnosis process. Both of these tools are software-only tools that do not require any modification or recompilation of applications, and custom operating system or hardware support.

最后,SyncPerf 提供了两个工具来帮助诊断性能错误的根本原因。 第一个是一种检测工具,可以报告具有潜在性能问题的易受攻击的调用点和同步变量,并确定一些根本原因,例如不对称争用和负载不平衡。 该工具的开销极低(平均仅为 2.3%)。 即使不使用采样机制,该工具也能实现如此低的开销。 低开销使该工具成为部署环境的理想选择。 当多个根本原因可能导致相同的行为并因此无法轻松诊断时,SyncPerf 提供了一个重量级的诊断工具,该工具收集了易受攻击的关键部分内的详细访问,以简化诊断过程。 这两种工具都是纯软件工具,不需要对应用程序进行任何修改或重新编译,也不需要自定义操作系统或硬件支持。

2. Overview

2.1 Categorization

2.1.1 Improper Primitvies

Programmers may use a variety of synchronization primitives (e.g., atomic instructions, spin locks, try-locks, read- /write locks, mutex locks etc.) to protect shared accesses. These primitives impose different runtime overhead, increasing from atomic instructions to mutex locks. The spin lock of pthread library, for example, incurs 50% less overhead than the mutex lock when there is no contention. However, during high contention, the spin lock may waste CPU cycles unnecessarily [1, 30].

程序员可以使用各种同步原语(例如,原子指令、自旋锁、尝试锁、读/写锁、互斥锁等)来保护共享访问。 这些原语施加了不同的运行时开销,从原子指令增加到互斥锁。 例如,pthread 库的自旋锁在没有争用时比互斥锁产生的开销少 50%。 然而,在高竞争期间,自旋锁可能会不必要地浪费 CPU 周期 [1, 30]。


调用的锁是同一种琐时,如何分辨调用的锁保护的内容是否是同一块呢?

比如 A 和 B 都有申请 lock ,计算机如何判断 A 和 B 申请的 lock 不是同一块区域呢?


Different synchronization primitives have different use cases. Atomic instructions are best suited to perform simple integer operations (e.g., read-modify-write, addition, subtraction, exchange etc.) on shared variables [9, 34]. Spin locks are effective for small critical sections that have very few instructions but cannot be finished using a single atomic instruction [1, 30]. Read/write locks are useful for readmostly critical sections [26, 32]. Try-locks allow a program to pursue an alternative path when locks are not available [38]. Finally, mutex locks are used when the critical sections contain waiting operations (e.g., conditional wait) and have multiple shared accesses. Any deviation from the preferred use cases may result in performance issues.

不同的同步原语有不同的用例。 原子指令最适合对共享变量执行简单的整数操作(例如,读-修改-写、加法、减法、交换等)[9, 34]。 自旋锁对于指令很少但无法使用单个原子指令完成的小型临界区非常有效 [1, 30]。 读/写锁对于只读关键部分很有用 [26, 32]。 当锁不可用时,尝试锁允许程序寻求替代路径[38]。 最后,当关键部分包含等待操作(例如,条件等待)并具有多个共享访问时,使用互斥锁。 与首选用例的任何偏差都可能导致性能问题。

Improper primitives (usually in Q2 and Q4) typically cause extensive try-lock failures or extensive lock acquisitions, but low to moderate contention. Extensive try-lock failures, where a try-lock fails immediately because the lock is held by another thread, indicate that we should use a blocking method that combines conditional variables with mutexes to avoid continuous trial. Extensive lock acquisitions may incur significant performance degradation even without high contention. The issue of improper primitives is ignored by existing work [8, 19, 23, 28, 41]. However, its importance can be seen from facesim application of PARSEC [3] where changing mutex locks to atomic instructions boosts performance by up to 30.7% (Table 2).

不正确的原语(通常在 Q2 和 Q4 中)通常会导致大量的 try-lock 失败或大量的锁获取,但争用程度较低到中等。 广泛的 try-lock 失败,其中 try-lock 由于锁被另一个线程持有而立即失败,这表明我们应该使用将条件变量与互斥锁相结合的阻塞方法,以避免连续试用。 即使没有高争用,广泛的锁获取也可能导致性能显着下降。 现有工作[8,19,23,28,41]忽略了不正确原语的问题。 然而,它的重要性可以从 PARSEC [3] 的 facesim 应用中看出,其中将互斥锁更改为原子指令可将性能提高多达 30.7%(表 2)。

2.1.2 Improper Granularity

Significant performance degradation may occur when locks are not used with a proper granularity. There are several cases listed as follows.

  1. If a lock protects too many data items (e.g., an entire hash table, as in the memcached-II bug of Table 2), the lock may introduce a lot of contention. Splitting a coarsegrained lock into multiple fine-grained locks helps improve performance.
  2. If a lock protects a large critical section with many instructions, it may cause high contention and thus, a significant slowdown. canneal of PARSEC, for example, has a critical section that includes a random number generator. Only few instructions inside the critical section access the shared data. Although the number of acquisitions is only 15, performance is boosted by around 4% when we move the random generator outside the critical section.
  3. If a critical section has very few instructions, then the overhead of lock acquisitions and releases may exceed the overhead of actual computations inside. In that case, the program can suffer from performance degradation [14]. One possible solution is to merge multiple locks into a single coarse-grained one.

当未以适当的粒度使用锁时,可能会出现显着的性能下降。有以下几种情况。

  1. 如果锁保护了太多的数据项(例如,整个哈希表,如表 2 的 memcached-II 错误),锁可能会引入很多争用。将粗粒度锁拆分为多个细粒度锁有助于提高性能。
  2. 如果一个锁用许多指令保护一个大的临界区,它可能会导致高争用,从而显着减速。例如,PARSEC 的 canneal 有一个关键部分,其中包括一个随机数生成器。临界区中只有很少的指令访问共享数据。虽然采集次数只有 15 次,但当我们将随机生成器移到临界区之外时,性能提升了约 4%。
  3. 如果临界区的指令很少,那么锁的获取和释放的开销可能会超过内部实际计算的开销。在这种情况下,程序可能会遭受性能下降 [14]。一种可能的解决方案是将多个锁合并为一个粗粒度锁。

Identification: Locks in the first two cases may incur significant contention. However, without knowing the memory accesses inside the critical section, it is hard to identify this type of problems manually. Therefore, SyncPerf provides an additional diagnosis tool that tracks all memory accesses protected by a specific lock. Programmers can use the tool offline after some potential problems have been identified by the detection tool. With the collected information, we can easily differentiate between the first two cases as described in Table 1. It is relatively hard to identify the third case.

识别:前两种情况下的锁可能会引起严重的争用。 但是,在不知道临界区内部的内存访问的情况下,很难手动识别此类问题。 因此,SyncPerf 提供了一个额外的诊断工具,可以跟踪受特定锁保护的所有内存访问。 在检测工具识别出一些潜在问题后,程序员可以离线使用该工具。 通过收集到的信息,我们可以轻松区分前两种情况,如表 1 所述。识别第三种情况相对困难。

2.1.3 Over-synchronization

过度同步

Over-synchronization indicates a situation where a synchronization becomes unnecessary because the computations do not require any protection or they are already protected by other synchronizations. This term is borrowed from existing work [23]. There are the following cases.

过度同步表示不需要同步的情况,因为计算不需要任何保护,或者它们已经受到其他同步的保护。 这个术语是从现有的工作中借用的[23]。 有以下几种情况。

1.A lock is unnecessary if a critical section only accesses the local data, but not the shared data. 2. A lock is unnecessary if the protected computations are already atomic. 3. A lock is unnecessary if another lock already protects the computations. MySQL-5.1 is known to have such a problem [7, 23], which utilizes the random() routine to determine the spin waiting time inside a mutex. Unfortunately, this routine has an internal lock that unnecessarily serialize every thread invoking this random() routine. The problem has been fixed by using a different random number generator that does not have any internal lock for the fastmutex.

1.如果临界区只访问本地数据,不访问共享数据,则不需要加锁。 2. 如果受保护的计算已经是原子的,则不需要锁定。 3. 如果另一个锁已经保护了计算,则不需要锁。

MySQL-5.1 已知有这样一个问题 [7, 23],它利用 random() 例程来确定互斥锁内的自旋等待时间。 不幸的是,这个例程有一个内部锁,它不必要地序列化每个调用这个 random() 例程的线程。 该问题已通过使用不同的随机数生成器解决,该生成器没有任何用于快速互斥锁的内部锁。

Identification: Over-synchronization problems can cause a significant slowdown when there are extensive lock acquisitions. This situation is similar to the first two categories of improper granularity issue. Therefore, our diagnosis tool (described in Section 3.2) may help analyze this situation. After a problem is identified, unnecessary locks can be removed to improve performance. However, removing locks may introduce correctness issue, and has to be done cautiously.

识别:当有大量锁获取时,过度同步问题可能会导致显着减速。 这种情况类似于前两类粒度不当的问题。 因此,我们的诊断工具(在第 3.2 节中描述)可能有助于分析这种情况。 确定问题后,可以删除不必要的锁以提高性能。 但是,删除锁可能会引入正确性问题,必须谨慎操作。

2.1.4 Asymmetric Contention

Asymmetric contention occurs when some locks have significantly more contention than others that protect similar data. This category is derived from “asymmetric lock” [10]. For instance, a hash table implementation may use bucket-wise locks. If the hash function fails to distribute the accesses uniformly, some buckets will be accessed more frequently than the others. Consequently, locks of those buckets will have more contention than the others. Coz [10] finds such a problem in dedup. Changing the hash function improves performance by around 12%.

当某些锁的争用明显多于保护类似数据的其他锁时,就会发生非对称争用。 该类别源自“非对称锁”[10]。 例如,哈希表实现可以使用桶式锁。 如果哈希函数不能均匀地分配访问,一些桶将比其他桶更频繁地被访问。 因此,这些桶的锁将比其他桶有更多的争用。 Coz [10] 在 dedup 中发现了这样一个问题。 更改散列函数可将性能提高约 12%。


非对称竞争原因在于锁的争用不能均匀,但什么情况下会出现使用 Hash 来散列锁呢?


Identification: To identify this type of problems, SyncPerf collects the number of lock acquisitions, how many times each lock is found to be unavailable, and their callsites. If multiple locks are protecting similar data (typically identified by the same callsites of lock acquisitions and releases), SyncPerf checks the lock contention rate and the number of acquisitions of these locks. When an asymmetric contention rate is found (e.g., when the highest contention rate is 2 × or more than the lowest one), SyncPerf reports an asymmetric contention problem. Asymmetric contention problem is reported automatically without any manual effort. Programmers, then, can fix the problem by evenly distributing the contention. Unlike SyncPerf, Coz relies on manual inspection to identify this type of problems.

识别:为了识别这类问题,SyncPerf 会收集锁的获取次数、每个锁被发现不可用的次数以及它们的调用点。 如果多个锁正在保护相似的数据(通常由相同的锁获取和释放调用点标识),SyncPerf 会检查锁争用率和这些锁的获取次数。 当发现非对称争用率时(例如,当最高争用率是最低争用率的 2 倍或更多时),SyncPerf 报告非对称争用问题。 非对称争用问题会自动报告,无需任何手动操作。 然后,程序员可以通过平均分配争用来解决问题。 与 SyncPerf 不同,Coz 依靠人工检查来识别此类问题。

2.1.5 Load Imbalance

A thread can wait due to synchronizations such as mutex locks, conditional variables, barriers, semaphores etc. A parent thread can also wait when it tries to join with the children threads. If a group of threads (i.e., threads with the same thread function) is found to have a waiting period much longer than that of other groups of threads, this may indicate a performance issue caused by load imbalance [12, 25, 33, 40].

由于互斥锁、条件变量、屏障、信号量等同步,线程可以等待。父线程在尝试加入子线程时也可以等待。 如果发现一组线程(即具有相同线程功能的线程)的等待时间比其他组线程的等待时间长得多,这可能表明负载不平衡导致的性能问题 [12,25,33,40 ]。

Identification: To identify load imbalance problems, it collects the execution and waiting time of different threads by intercepting thread creations and synchronization functions. If the waiting time or computation time of different threads are substantially different (e.g., outside a certain range, say 20%), the program can be identified as having a load imbalance problem.

识别:为了识别负载不平衡问题,它通过拦截线程创建和同步函数来收集不同线程的执行和等待时间。 如果不同线程的等待时间或计算时间有很大差异(例如,超出一定范围,例如 20%),则可以将程序识别为存在负载不平衡问题。

Finding an optimal task assignment: SyncPerf can suggest an optimal task assignment for load imbalance problems after the identification. It calculates the computation time of every thread by subtracting all waiting time (on conditional variables, mutex locks, and barriers) from their execution time. It then computes the total computation time of different groups of threads according to their thread functions, where threads executing the same function belong to the same group. In the end, SyncPerf suggests an optimal task distribution – each group of threads will be assigned an optimal number of threads that is proportional to the total workload of that type. Section 4.4.5 presents some examples showing how SyncPerf can suggest an optimal configuration for different types of threads to fix the load imbalance problems.

寻找最佳任务分配:SyncPerf 可以在识别后针对负载不平衡问题提出最佳任务分配。 它通过从执行时间中减去所有等待时间(条件变量、互斥锁和障碍)来计算每个线程的计算时间。 然后根据线程函数计算不同组线程的总计算时间,其中执行相同函数的线程属于同一组。 最后,SyncPerf 建议了一个最佳的任务分配——每组线程将被分配一个与该类型的总工作量成正比的最佳线程数。 第 4.4.5 节提供了一些示例,展示了 SyncPerf 如何为不同类型的线程建议最佳配置以解决负载不平衡问题。

2.2 Workflow

The high level workflow of SyncPerf is shown as Figure 2. For mutex locks, SyncPerf reports locks inside 3 quadrants (Q1, Q2, and Q4 of Figure 1), while skipping Q3 locks that do not cause performance issues. Additionally, it reports try-lock failure rates and whether there is a load imbalance problem among different types of threads. For the load imbalance problem, SyncPerf not only reports the root cause but also suggests an optimal configuration for different types of threads. This is done without any manual intervention. Programmers can use the suggested distribution to fix the load imbalance problem. If there is an asymmetric contention problem among similar locks, the tool automatically identifies the root cause. However, it is up to the programmer to develop a possible fix.

SyncPerf 的高级工作流程如图 2 所示。对于互斥锁,SyncPerf 报告 3 个象限内的锁(图 1 的 Q1、Q2 和 Q4),同时跳过不会导致性能问题的 Q3 锁。 此外,它还报告尝试锁定失败率以及不同类型线程之间是否存在负载不平衡问题。 对于负载不平衡问题,SyncPerf 不仅会报告根本原因,还会针对不同类型的线程提出最佳配置建议。 这是在没有任何人工干预的情况下完成的。 程序员可以使用建议的分布来解决负载不平衡问题。 如果类似锁之间存在不对称争用问题,该工具会自动识别根本原因。 但是,由程序员开发可能的修复程序。

After getting the behavior of locks in 3 quadrants, if the reported code segments are simple, programmers can easily inspect them manually to determine which category a problem belongs to and take corresponding actions. This can be as simple as consulting Table 1. For complex situation, our additional diagnosis tool can collect detailed information for critical sections that reported by our detection tool, in order to help programmers determine the particular type of performance issues. Again, Table 1 can be used as an informal guideline during the categorization process. After determining the type of performance bugs, Table 1 can guide programmers to develop a fix for the bug. Some of the fixing strategies (e.g., fixing of over-synchronization problem) might require programmers to carefully consider correctness issues.

得到三象限锁的行为后,如果上报的代码段很简单,程序员可以很容易地手动检查,判断问题属于哪一类,并采取相应的行动。 这可以像参考表 1 一样简单。对于复杂的情况,我们的附加诊断工具可以收集我们的检测工具报告的关键部分的详细信息,以帮助程序员确定特定类型的性能问题。 同样,表 1 可用作分类过程中的非正式指南。 在确定性能错误的类型后,表 1 可以指导程序员针对该错误开发修复程序。 一些修复策略(例如,修复过度同步问题)可能需要程序员仔细考虑正确性问题。

Implementation Details

SyncPerf provides two tools to assist programmers in identifying bugs and fixing them: a detection tool and a diagnosis tool. By combining these two tools, SyncPerf not only answers “what” and “where” questions, but also “why” and “how to fix” (partially) questions for most synchronization related performance bugs.

SyncPerf 提供了两种工具来帮助程序员识别和修复错误:检测工具和诊断工具。 通过结合这两个工具,SyncPerf 不仅回答了“什么”和“在哪里”的问题,而且还回答了大多数与同步相关的性能错误的“为什么”和“如何修复”(部分)问题。

The detection tool uses a lightweight profiling scheme to detect synchronizations with potential performance issues. It can also diagnose the root causes for asymmetric contention, extensive try-lock failures, and load imbalance problems without any manual intervention. The detection tool achieves a lower performance overhead than existing tools (even without using the sampling mechanism) [41]. Details of this tool are presented in Section 3.1. The diagnosis tool is based on Pin [29], a binary instrumentation tool. The diagnosis tool monitors memory accesses inside specific critical sections to help identify root causes of problems with the same behavior. This heavyweight diagnosis tool is only employed when the detection tool reports some potential problems that cannot be diagnosed easily. It utilizes prior knowledge of the particular problems that are reported by the detection tool, and thus, instruments memory accesses inside the relevant critical sections only. Its overhead is about 6 × lower than the existing work that instruments all memory accesses [43].

检测工具使用轻量级分析方案来检测具有潜在性能问题的同步。它还可以诊断不对称争用、大量尝试锁定失败和负载不平衡问题的根本原因,而无需任何人工干预。检测工具实现了比现有工具更低的性能开销(即使不使用采样机制)[41]。该工具的详细信息在第 3.1 节中介绍。诊断工具基于 Pin [29],一种二进制检测工具。诊断工具监视特定关键部分内的内存访问,以帮助确定具有相同行为的问题的根本原因。这种重量级的诊断工具仅在检测工具报告一些不易诊断的潜在问题时才使用。它利用检测工具报告的特定问题的先验知识,因此,仅在相关关键部分内检测内存访问。它的开销比现有的检测所有内存访问的工作低约 6 倍 [43]。

3.1 Detection Tool

The challenge of SyncPerf is to collect data effificiently and analyze them effectively.

SyncPerf面临的挑战是有效地收集数据和分析它们。

3.1.1 Collecting Data Efiiciently

To collect the data, SyncPerf intercepts pthread’s different types of explicit synchronization primitives, such as mutex locks, try-locks, conditional variables, barriers, and thread creation and exit functions, where the actual implementation is borrowed from the pthread library. This is similar to existing work [41]. However, SyncPerf outperforms them with a lower overhead and better detection ability.

为了收集数据,SyncPerf 截获 pthread 的不同类型的显式同步原语,例如互斥锁、try-locks、条件变量、屏障以及线程创建和退出函数,其中实际实现是从 pthread 库中借用的。 这类似于现有的工作[41]。 但是,SyncPerf 以更低的开销和更好的检测能力胜过它们。


下面这一实现很重要。后续需要对比源码是如何进行操作的。

RDTSC timer


SyncPerf intercepts pthread create function calls and passes a custom function to the actual pthread create function. This custom function calls the actual start routine function, and collects timestamps of thread starting and exiting using RDTSC timer [22]. The timestamps are saved into a thread wrapper as shown in Figure 3(b).

SyncPerf 拦截 pthread create 函数调用并将自定义函数传递给实际的 pthread create 函数。 此自定义函数调用实际的启动例程函数,并使用 RDTSC 计时器 [22] 收集线程启动和退出的时间戳。 时间戳保存到线程包装器中,如图 3(b) 所示。

SyncPerf utilizes the following mechanisms to achieve the extremely low overhead.

SyncPerf 利用以下机制来实现极低的开销。

Indirection and per-thread data: To collect data for mutex locks, a possible approach (used by existing work [41]) is to store the actual profiling data for each mutex lock in a global hash table. Upon every mutex invocation, we can lookup the hash table to find the pointer to the actual data, and then update it correspondingly. However, this approach introduces significant overhead due to the hash table lookup (and possible lock protection) on every synchronization operation, and the possible cache coherence messages to update the shared data (true/false sharing effect) [16, 21]. This is especially problematic when there is a significant number of acquisitions.

间接和每线程数据:为了收集互斥锁的数据,一种可能的方法(现有工作 [41] 使用)是将每个互斥锁的实际分析数据存储在全局哈希表中。 在每次互斥体调用时,我们可以查找哈希表以找到指向实际数据的指针,然后相应地更新它。 然而,由于每次同步操作上的哈希表查找(和可能的锁保护),以及更新共享数据的可能缓存一致性消息(真/假共享效应)[16, 21],这种方法引入了显着的开销。 当有大量 acquisitions 时,这尤其成问题。


Instead, SyncPerf uses a level of indirection to avoid the lookup overhead, and a per-thread data structure to avoid the cache coherence traffic. The data structure is shown in Figure 3(a).

For every mutex, SyncPerf allocates a shadow mutex t object and uses the first word of the original mutex t object as a pointer to this shadow object. The shadow mutex structure contains a real mutex t object, an index for this mutex object, and some other data. The index is initialized during the initialization of the mutex, or during the first lock acquisition if the mutex is not explicitly initialized. This index is used to find an entry in the global Mutex Data Table, where each thread has a thread-wise entry. When a thread operates on a mutex lock, say Li, SyncPerf obtains the shadow mutex t object by checking the first word of the original mutex t object, and then finds its corresponding thread-wise entry using the index value. After that, the lock related data can be stored in its thread-wise entry, without generating any cache coherence message. Furthermore, SyncPerf prevents the false sharing effect by carefully keeping read-mostly data in shadow mutex t object and padding them properly [4, 27], while the actual profiling data (that keeps changing) is stored in thread-wise entries. The thread-wise data is collected and integrated in the reporting phase (Section 3.1.2).

相反,SyncPerf 使用间接级别来避免查找开销,并使用每线程数据结构来避免缓存一致性流量。数据结构如图 3(a) 所示。对于每个 mutex,SyncPerf 分配一个 shadow_mutex_t 对象,并使用原始互斥体 t 对象的第一个字作为指向该影子对象的指针。影子互斥体结构包含一个真正的互斥体对象、这个互斥体对象的索引和一些其他数据。索引在互斥锁的初始化期间被初始化,或者如果互斥锁没有显式初始化,则在第一次锁获取期间被初始化。该索引用于在全局互斥数据表中查找条目,其中每个线程都有一个线程级条目。

当一个线程对一个互斥锁进行操作时,比如 Li,SyncPerf 通过检查原始互斥锁 t 对象的第一个字来获得影子互斥锁 t 对象,然后使用索引值找到其对应的逐线程条目。之后,与锁相关的数据可以存储在其线程条目中,而无需生成任何缓存一致性消息。此外,SyncPerf 通过小心地将只读数据保存在 shadow mutex t 对象中并正确填充它们来防止错误共享效应 [4, 27],而实际的分析数据(不断变化)存储在线程条目中。在报告阶段收集和集成线程数据(第 3.1.2 节)。


Fast collection of callsites: SyncPerf collects callsite information of every synchronization operation to provide exact source code location of performance bugs. It is crucial to minimize the overhead of collecting callsites, especially when there is a large number of synchronization operations.

SyncPerf makes three design choices to reduce the overhead. First, SyncPerf avoids the use of backtrace API of glibc, which is extremely slow due to its heavyweight instruction analysis. Instead of using backtrace, SyncPerf analyzes frame pointers to obtain call stacks efficiently. However, this can impose a limitation that SyncPerf cannot collect callsite information for programs without frame pointers.

Second, SyncPerf collects call stacks up to the depth of 5. We limit the depth because deeper stacks introduce more overhead without any significant benefit.

调用点的快速收集:SyncPerf 收集每个同步操作的调用点信息,以提供性能错误的确切源代码位置。最小化收集调用点的开销至关重要,尤其是在有大量同步操作时。

SyncPerf 做了三个设计选择来减少开销。首先,SyncPerf 避免了使用 glibc 的 backtrace API,由于其重量级的指令分析,该 API 非常慢。 SyncPerf 不使用回溯,**而是分析帧指针以有效地获取调用堆栈。**但是,这可能会造成 SyncPerf 无法为没有帧指针的程序收集调用站点信息的限制。

其次,SyncPerf 收集深度为 5 的调用堆栈。我们限制了深度,因为更深的堆栈会引入更多的开销,而没有任何显着的好处。

Third, SyncPerf avoids collecting already-existing callsites. Obtaining the callsite of a synchronization and comparing it against all existing callsites one by one (to determine whether this is a new one) may incur substantial overhead. Alternatively, SyncPerf utilizes the combination of the lock address and the offset between the stack pointer (rsp register) and the top of the current thread’s stack to identify the call stack. When different threads invoke a synchronization operation at the same statement, the combination of the lock address and stack offset are likely to be the same. If a combination is the same as that of one of the existing callsites, SyncPerf does not collect callsite information. This method can significantly reduce the overhead of callsite collection and comparison.

第三,SyncPerf 避免收集已经存在的调用点。获取同步的调用点并将其与所有现有调用点一一进行比较(以确定这是否是新的)可能会产生大量开销。或者,SyncPerf 利用锁定地址和堆栈指针(rsp 寄存器)与当前线程堆栈顶部之间的偏移量的组合来识别调用堆栈。当不同线程在同一条语句中调用同步操作时,锁地址和堆栈偏移量的组合很可能是相同的。如果组合与现有调用点之一的组合相同,则 SyncPerf 不会收集调用点信息。这种方法可以显着减少调用点收集和比较的开销。

Other mechanisms: To further reduce the runtime overhead, SyncPerf avoids any overhead due to memory allocation by preallocating the Mutex Data Table and a pool of shadow mutex objects. This is done during the program initialization phase. SyncPerf assumes a predefined but adjustable maximum number of threads and mutex objects for this purpose. Also, SyncPerf puts data collection code outside a critical section as much as possible to avoid expanding the critical section. This avoids unnecessary serialization of threads.

其他机制:为了进一步减少运行时开销,SyncPerf 通过预先分配 Mutex 数据表和影子互斥对象池来避免由于内存分配而产生的任何开销。 这是在程序初始化阶段完成的。 SyncPerf 为此目的假定了一个预定义但可调整的最大线程数和互斥对象数。 此外,SyncPerf 尽可能将数据收集代码放在临界区之外,以避免扩展临界区。 这避免了不必要的线程序列化。

Because of these careful design choices, SyncPerf imposes very low runtime overhead (2.3%, on average). Even for an application such as fluidanimate that acquires 40K locks per millisecond, SyncPerf imposes only 19% runtime overhead. Due to its low overhead, SyncPerf’s detection tool can be used in production runs.

由于这些精心的设计选择,SyncPerf 的运行时开销非常低(平均为 2.3%)。 即使对于像流体动画这样每毫秒获取 40K 锁的应用程序,SyncPerf 也只产生 19% 的运行时开销。 由于开销低,SyncPerf 的检测工具可用于生产运行。

3.1.2 Analyzing and Reporting Problems

SyncPerf reports problems when a program is about to exit or it receives a special signal like SIGUSER2. SyncPerf performs two steps to generate a report.

SyncPerf 在程序即将退出或收到特殊信号(如 SIGUSER2)时报告问题。 SyncPerf 执行两个步骤来生成报告。

First, it combines all thread-wise data of a particular synchronization together to check the number of lock acquisitions, lock contentions, and try-lock failures. It reports potential problems if any synchronization variable shows the behavior listed in Section 2.

首先,它将特定同步的所有线程数据组合在一起,以检查锁获取、锁争用和尝试锁失败的次数。 如果任何同步变量显示第 2 节中列出的行为,它会报告潜在问题。

Second, SyncPerf integrates information of different synchronization variables and threads together in order to discover more potential problems. (1) The behavior of locks with the same callsites are compared with each other: if some locks have significantly more contention than others, then there is a problem of asymmetric contention (Section 2.1.4). (2) Even if one particular lock is not acquiredmany times, the total number of acquisitions of locks with the same callsite can be significant and thus, cause a severe performance issue. (3) SyncPerf integrates information of different threads together to identify load imbalance problems. When one type of threads (with the same thread function) have “disproportionate waiting time”, it is considered to be a strong indicator for the load imbalance issue (Section 2.1.5). The integration of information helps find more potential problems.

其次,SyncPerf 将不同同步变量和线程的信息整合在一起,以便发现更多潜在的问题。 (1) 比较具有相同调用点的锁的行为:如果某些锁的争用明显多于其他锁,则存在非对称争用问题(第 2.1.4 节)。 (2) 即使一个特定的锁没有被多次获取,相同调用点的锁的获取总数可能很大,因此会导致严重的性能问题。 (3) SyncPerf 将不同线程的信息整合在一起,以识别负载不平衡问题。 当一种类型的线程(具有相同的线程功能)具有“不成比例的等待时间”时,它被认为是负载不平衡问题的一个强有力的指标(第 2.1.5 节)。 信息整合有助于发现更多潜在问题。

3.2 Diagnosis Tool

The same behavior (e.g., lock contention) may be caused by different root causes, such as asymmetric contention, improper granularity, or over-synchronization. Therefore, SyncPerf provides a heavyweight diagnosis tool to help identify root causes of such problems. This heavyweight diagnosis tool is optional and not meant for production runs. Only when some potential problems are detected but they are hard to be diagnosed manually, this diagnosis tool may provide further information (e.g., memory accesses inside critical sections) that include: how many instructions are executed on average inside each critical section; how many of these instructions access shared and non-shared locations; how many different memory locations are accessed inside a critical section; and how many instructions are read or write accesses.

相同的行为(例如,锁争用)可能是由不同的根本原因引起的,例如不对称争用、不适当的粒度或过度同步。 因此,SyncPerf 提供了一个重量级的诊断工具来帮助确定此类问题的根本原因。 这个重量级的诊断工具是可选的,不适用于生产运行。 只有在检测到一些潜在问题但难以手动诊断时,该诊断工具才会提供更多信息(例如,临界区内部的内存访问),包括:每个临界区平均执行了多少条指令; 这些指令中有多少访问共享和非共享位置; 在临界区中访问了多少个不同的内存位置; 以及有多少指令是读或写访问。

SyncPerf’s diagnosis tool is based on a binary instrumentation framework, Pin [29]. It takes a list of problematic locks (along with their callsites) as the input, which is generated from the detection tool’s report. When a lock function is encountered, it checks whether the lock is one of the problematic ones. If so, it keeps counting the instructions, and monitoring the memory accesses inside. The tool also maintains a hash table to keep track of memory locations inside critical sections. The hash table helps find out how many data items have been accessed inside a critical section. This information help identify the situation where a lock protects too many data items, or too many instructions that are accessing non-shared data inside a critical section. Like the detection tool, the diagnosis tool maintains thread-wise and lock-wise counters for each synchronization. It also integrates information together in the end.

SyncPerf 的诊断工具基于二进制检测框架 Pin [29]。 它需要一个有问题的锁列表(连同它们的调用点)作为输入,它是从检测工具的报告中生成的。 当遇到锁函数时,它会检查锁是否是有问题的锁之一。 如果是这样,它会继续计算指令,并监视内部的内存访问。 该工具还维护一个哈希表来跟踪关键部分内的内存位置。 哈希表有助于找出在临界区中访问了多少数据项。 此信息有助于识别锁保护过多数据项或访问关键部分内的非共享数据的指令过多的情况。 与检测工具一样,诊断工具为每个同步维护线程和锁定计数器。 它最终还将信息集成在一起。

4. Evaluation

This section will answer the following questions:

  • Usage Example: What are the outputs of SyncPerf’s tools? How we can utilize the report to identify root causes? (Section 4.2)
  • Bug Detection Ability: Can SyncPerf detect real performance bugs related to synchronizations? (Section 4.3 and 4.4)
  • Performance Overhead: What is the performance overhead of SyncPerf’s detection and diagnosis tools? (Section 4.5)
  • Memory Overhead: What is the memory overhead of the detection tool? (Section 4.6)

使用示例:SyncPerf 工具的输出是什么? 我们如何利用该报告来确定根本原因? (第 4.2 节)

漏洞检测能力:SyncPerf 能否检测到与同步相关的真实性能漏洞? (第 4.3 和 4.4 节)

性能开销:SyncPerf 的检测和诊断工具的性能开销是多少? (第 4.5 节)

内存开销:检测工具的内存开销是多少? (第 4.6 节)

4.1 Experimental Setup

We performed experiments on a 16-core idle machine, with two-socket Intel(R) Xeon(R) CPU E5-2640 processors and 256GB of memory. It has 256KB L1, 2MB L2, and 20M L3 cache. The experiments were performed on the unchanged Ubuntu 14.10 operating system. We used GCC-4.9.1 with -O2, -g and -fno-omit-frame-pointer flags to compile all applications. SyncPerf utilizes the following parameters for the detection: contention rate larger than 10% is considered to be high, and the number of lock acquisition larger than 1000 per second is considered to be high. These thresholds are empirically determined. The parameters can be easily adjusted during the compilation of the detection tool. Section 4.3 evaluates false positives when using these parameters.

我们在一台 16 核空闲机器上进行了实验,该机器具有两路 Intel(R) Xeon(R) CPU E5-2640 处理器和 256GB 内存。 它具有 256KB L1、2MB L2 和 20M L3 缓存。 实验在未更改的 Ubuntu 14.10 操作系统上进行。 我们使用带有 -O2、-g 和 -fno-omit-frame-pointer 标志的 GCC-4.9.1 来编译所有应用程序。 SyncPerf 使用以下参数进行检测:大于 10% 的争用率被认为是高的,每秒大于 1000 的锁获取次数被认为是高的。 这些阈值是凭经验确定的。 在编译检测工具的过程中可以很容易地调整参数。 4.3 节评估使用这些参数时的误报。

Evaluated Applications: We used a well-tuned benchmark suite, PARSEC [3], with native inputs. PARSEC applications have complexity comparable with real applications (see Table 3). We also evaluated three widely used real world applications: Apache, MySQL, and Memcached. We ran Apache-2.4 server program with the ab client that is distributed with the source code. We tested MySQL-5.6.27 using the sysbench client and the mysql-test. For memcached, we evaluated two different versions – memcached- -1.4.4, and memcached-2.4.24, which are all exercised using the memslap benchmark. Data presented in the paper is for memcached-2.4.24 unless otherwise mentioned.

评估的应用程序:我们使用了经过良好调整的基准套件 PARSEC [3],并带有本地输入。 PARSEC 应用程序具有与实际应用程序相当的复杂性(参见表 3)。 我们还评估了三个广泛使用的现实世界应用程序:Apache、MySQL 和 Memcached。 我们使用随源代码分发的 ab 客户端运行 Apache-2.4 服务器程序。 我们使用 sysbench 客户端和 mysql-test 测试了 MySQL-5.6.27。 对于 memcached,我们评估了两个不同的版本——memcached--1.4.4 和 memcached-2.4.24,它们都使用 memslap 基准测试。 除非另有说明,本文中提供的数据适用于 memcached-2.4.24。

4.2 Usage Examples

SyncPerf provides two tools that help identify the root causes of problems. This section shows a usage example for application canneal of PARSEC.Figure 4 shows an example report generated by SyncPerf’s detection tool. For locks, it reports the results of three quadrants as shown in Figure 1. For each lock, SyncPerf reports source code information. For canneal, SyncPerf only reports one lock with high contention rate and low acquisition frequency in rng.h file. The corresponding code is shown in Figure 6. It is not very easy to understand this case. Therefore, we can resort to SyncPerf’s diagnosis tool.

SyncPerf 提供了两个工具来帮助确定问题的根本原因。 本节展示了 PARSEC 应用程序通道的使用示例。图 4 显示了 SyncPerf 的检测工具生成的示例报告。 对于锁,它报告三个象限的结果,如图 1 所示。对于每个锁,SyncPerf 报告源代码信息。 对于 canneal,SyncPerf 只在 rng.h 文件中报告一个竞争率高、获取频率低的锁。 对应的代码如图6所示。这个案例不是很容易理解。 因此,我们可以求助于 SyncPerf 的诊断工具。

The diagnosis tool takes the reported locks from a specified file in the same directory, mostly call stacks of corresponding locks, as the input. An example of the report is shown in Figure 5. For canneal application, SyncPerf’s diagnosis tool reports that only less than 1% instructions access the shared memory. Further consultation of the source code indicates that seed is the only shared access inside the critical sections. However, canneal currently puts the whole random generator inside the critical section, as described in Section 4.4.3. Moving the random generator out of the critical section improves the performance of this application by 4%.

诊断工具从同一目录下的指定文件中获取报告的锁,主要是对应锁的调用堆栈,作为输入。 报告示例如图 5 所示。对于 canneal 应用程序,SyncPerf 的诊断工具报告只有不到 1% 的指令访问共享内存。 进一步查阅源代码表明,种子是关键部分内唯一的共享访问。 然而,canneal 目前将整个随机生成器置于临界区,如 4.4.3 节所述。 将随机生成器移出临界区可使该应用程序的性能提高 4%。

4.3 Effectiveness

SyncPerf is effective in detecting synchronization related performance bugs. The results are shown in Table 2. SyncPerf detected nine performance bugs in PARSEC and six performance bugs in real world applications. Among the 15 performance bugs, seven were previously undiscovered, including three in large real applications such as MySQL and memcached. We have notified programmers of all of these new performance bugs. The MySQL-I bug does not exist any more because the corresponding functions have been removed in a later version (MySQL-5.7). Remaining bugs are still under review.

SyncPerf 可有效检测与同步相关的性能错误。 结果如表 2 所示。SyncPerf 在 PARSEC 中检测到 9 个性能错误,在实际应用程序中检测到 6 个性能错误。 在这 15 个性能 bug 中,有 7 个是以前未被发现的,其中 3 个是在大型实际应用程序中发现的,例如 MySQL 和 memcached。 我们已将所有这些新的性能错误通知程序员。 MySQL-I 的 bug 已经不存在了,因为相应的功能在之后的版本(MySQL-5.7)中已经被移除了。 剩余的错误仍在审查中。

False Positives: We evaluated false positives of SyncPerf, using the threshold for contention rate and acquisition frequency of 10% and 1000 per second respectively. SyncPerf has no false positives for 12 programs (Table 2) of PARSEC and Memcached application. SyncPerf reports two potential performance problems in Apache. We have fixed one of them, with around 8% performance improvement. SyncPerf reports another one with high acquisition frequency (1252 per second) and low contention rate (4.5%). This is related to one big mutex of the queue. Fixing this problem requires significant changes in code. Therefore, we did not solve this problem. For MySQL, SyncPerf reports three potential performance bugs. Two of them have been fixed with performance improvement of 19% and 11% respectively. Another bug is related to keycache'scache lock. This lock has high acquisition frequency (1916 per second) and low contention rate (0.0%). We tried to use spin lock as a replacement for the mutex lock, but we did not achieve any performance improvement. Therefore, this could be a potential false positive of SyncPerf. Thus, SyncPerf reports only two potential false positives at most.

误报:我们评估了 SyncPerf 的误报,分别使用 10% 和每秒 1000 次的争用率阈值和采集频率。 SyncPerf 对 PARSEC 和 Memcached 应用程序的 12 个程序(表 2)没有误报。 SyncPerf 报告了 Apache 中的两个潜在性能问题。我们已经修复了其中一个,性能提升了大约 8%。 SyncPerf 报告另一个具有高采集频率(每秒 1252 次)和低争用率 (4.5%) 的报告。这与队列的一个大互斥锁有关。解决此问题需要对代码进行重大更改。因此,我们没有解决这个问题。对于 MySQL,SyncPerf 报告了三个潜在的性能错误。其中两个已修复,性能分别提高了 19% 和 11%。另一个错误与 keycache'scache 锁有关。该锁具有高获取频率(每秒 1916 次)和低争用率(0.0%)。我们尝试使用自旋锁作为互斥锁的替代品,但没有取得任何性能提升。因此,这可能是 SyncPerf 的潜在误报。因此,SyncPerf 最多只报告两个潜在的误报。

False Negatives: It is difficult to assess whether SyncPerf has any false negative or not since there is no oracle that provides a complete list of all performance bugs in the evaluated applications. One option would be to experiment with known performance bugs. Our results indicate that SyncPerf detects all known performance bugs from the evaluated applications.

False Negatives:很难评估 SyncPerf 是否有任何 false negative,因为没有预言机可以提供评估应用程序中所有性能错误的完整列表。 一种选择是尝试已知的性能错误。 我们的结果表明 SyncPerf 从评估的应用程序中检测到所有已知的性能错误。

4.4 Case Studies

This section provides more details about the detected performance bugs.

4.4.1 Extensive Acquisitions and High Contention

Existing tools [8, 19, 23, 28, 41, 43] mainly focus on performance bugs with this external symptom. However, only 4 out of 15 detected bugs have this symptom and they belong to three different categories as described below.

现有工具 [8, 19, 23, 28, 41, 43] 主要关注具有这种外部症状的性能错误。 但是,检测到的 15 个错误中只有 4 个具有此症状,它们属于如下所述的三个不同类别。

Asymmetric Contention: dedup is a compression program with data de-duplication algorithm. It has extensive lock acquisitions (23531 per second) and a high contention rate (13.6%) in an array of locks (encoder.c:1051). These locks protect different buckets of a hash table. SyncPerf detects these locks with asymmetric contention problems: these locks (with the same callsite) have different number of lock acquisitions, ranging from 3 to 8586; the one with the most acquisitions has a contention rate of 13.6%, while others have less than 1% contention rate. This bug is detected by Coz, but that requires expertise to identify the root cause [10]. Instead, SyncPerf can automatically identify this bug, without resorting to manual expertise. By changing the hash function to reduce hash collision using the prosed fix by the Coz paper, the performance is improved by 12.1%.

Asymmetric Contention: dedup 是一个带有重复数据删除算法的压缩程序。 它具有大量锁获取(每秒 23531 次)和锁数组(encoder.c:1051)中的高争用率(13.6%)。 这些锁保护哈希表的不同存储桶。 SyncPerf 检测这些具有非对称争用问题的锁:这些锁(具有相同的调用站点)具有不同的锁获取次数,范围从 3 到 8586; 收购最多的竞争率为 13.6%,而其他的竞争率低于 1%。 Coz 检测到此错误,但这需要专业知识来确定根本原因 [10]。 相反,SyncPerf 可以自动识别此错误,而无需求助于人工专业知识。 通过使用 Coz 论文的散列修复更改哈希函数以减少哈希冲突,性能提高了 12.1%。

Improper Granularity: Memcached-1.4.4 has a known performance bug caused by improper granularity of locks. It uses a single cache lock to protect an entire hash table [13]. When we used memslap to generate 10000 get and set requests to exercise Memcached (with 16 threads), SyncPerf detects 71405 lock acquisitions per second and a high contention rate (45.8%). The diagnosis tool finds that a single lock protects over 9 million different shared locations. Clearly, this lock is too coarse-grained. Changing the global cache lock to an array of item lock as appeared in Memcached-2.4.24 improves the throughput by 16.3%. This bug is shown as memcached-II in Table 2.

不适当的粒度:Memcached-1.4.4 有一个由不适当的锁粒度引起的已知性能错误。 它使用单个缓存锁来保护整个哈希表 [13]。 当我们使用 memslap 生成 10000 个 get 和 set 请求来行使 Memcached(16 个线程)时,SyncPerf 检测到每秒 71405 个锁获取和高争用率(45.8%)。 诊断工具发现一个锁可以保护超过 900 万个不同的共享位置。 显然,这个锁的粒度太粗了。 将全局缓存锁更改为 Memcached-2.4.24 中出现的项锁数组可将吞吐量提高 16.3%。 此错误在表 2 中显示为 memcached-II。

MySQL, a popular database server program, has a similar problem (MySQL-II in Table 2) [2]. When the input table data is not using the default character set of the server or latin1, MySQL calls get internal charset() function. SyncPerf detects extensive lock acquisitions (146299 per second) and a high contention rate (38.5%). Furthermore, SyncPerf’s diagnosis tool reports that a single mutex lock protects 512 different shared variables, with 16384 bytes in total. By replacing the lock with an array of locks with one lock per charset [2], the throughput of MySQL is improved by 10.9%.

MySQL 是一种流行的数据库服务器程序,也有类似的问题(表 2 中的 MySQL-II)[2]。 当输入的表数据没有使用服务器的默认字符集或 latin1 时,MySQL 调用 get internal charset() 函数。 SyncPerf 检测到大量锁获取(每秒 146299)和高争用率(38.5%)。 此外,SyncPerf 的诊断工具报告说,单个互斥锁可以保护 512 个不同的共享变量,总共 16384 个字节。 通过将锁替换为每个字符集一个锁的锁数组 [2],MySQL 的吞吐量提高了 10.9%。

SyncPerf reports a new performance bug (MySQL-I) in end_thr_alarm function of MySQL. SyncPerf reports extensive lock acquisitions (723K per second) and a high contention rate (25.5%) for mutex LOCK alarm. The critical section has unnecessary conditional waits inside, possibly caused by code evolution. Programmers might have restructured the code logic, but forgot to remove these unnecessary waits. Removing the conditional wait improves performance of MySQL by 18.9%. We have reported this problem to programmers of MySQL and they replied that the corresponding code has been removed in MySQL-5.7.

SyncPerf 在 MySQL 的 end_thr_alarm 函数中报告了一个新的性能错误 (MySQL-I)。 SyncPerf 报告大量锁获取(每秒 723K)和互斥锁警报的高争用率(25.5%)。 临界区内部有不必要的条件等待,可能是代码演化造成的。 程序员可能已经重构了代码逻辑,但却忘记了移除这些不必要的等待。 删除条件等待将 MySQL 的性能提高了 18.9%。 我们已经向 MySQL 的程序员报告了这个问题,他们回复说在 MySQL-5.7 中已经删除了相应的代码。

4.4.2 Extensive Acquisitions but Low Contention

These locks are in Q4 of Figure 1 and are practically ignored by existing tools. As shown in Table 2, 5 out of 15 performance bugs fall into this category. They are new performance bugs detected by SyncPerf.

这些锁在图 1 的第 4 象限,并且实际上被现有工具忽略了。 如表 2 所示,15 个性能错误中有 5 个属于此类别。 它们是 SyncPerf 检测到的新性能错误。

Improper Primitives: facesim is a PARSEC application that simulates the motion of human faces. SyncPerf detects that one type of locks (ones with the same callsite) has 15288 acquisitions per second but the contention rate is very low (4.6%). We replaced mutex locks and conditional variables with atomic instructions, and that improved the performance by 31%. A code snippet of fix is shown in Figure 7.

Improper Primitives:facesim 是一个 PARSEC 应用程序,可以模拟人脸的运动。 SyncPerf 检测到一种类型的锁(具有相同调用站点的锁)每秒有 15288 次获取,但争用率非常低 (4.6%)。 我们用原子指令替换了互斥锁和条件变量,性能提高了 31%。 修复代码片段如图 7 所示。

5. Limitations and Future Work

SyncPerf has some limitations: First, SyncPerf cannot identify performance bugs due to ad hoc synchronizations [42], atomic instructions or transactional memory [20]. Currently, it only focuses on performance problems related to explicit synchronization primitives. More specifically, the current implementation only supports POSIX APIs related to synchronizations, and is only verified on the Linux. However, the same idea can be easily applied to other threading libraries. Second, SyncPerf cannot check contention of internal locks inside the glibc library. This can be fixed if the implementation is embedded inside the glibc library. Finally, when there is no frame pointer inside a program’s binary, SyncPerf may need to use backtrace to acquire callsite information or the program may requires the recompilation. The first method may incur more overhead for its detection tool. In future, we would like to extend our tools to overcome some of these limitations. In addition, we would like to include a graphical interface so that some visual representation of results can be provided.

SyncPerf 有一些限制:

首先,SyncPerf 无法识别由于临时同步 [42]、原子指令或事务内存 [20] 而导致的性能错误。目前,它只关注与显式同步原语相关的性能问题。更具体地说,当前的实现仅支持与同步相关的 POSIX API,并且仅在 Linux 上进行了验证。然而,同样的想法可以很容易地应用于其他线程库。

其次,**SyncPerf 无法检查 glibc 库中的内部锁争用情况。**如果实现嵌入在 glibc 库中,则可以修复此问题。

最后**,当程序的二进制文件中没有帧指针时,SyncPerf 可能需要使用回溯来获取调用点信息,或者程序可能需要重新编译。**第一种方法可能会为其检测工具带来更多开销。

将来,我们希望扩展我们的工具来克服其中的一些限制。此外,我们希望包括一个图形界面,以便可以提供一些结果的可视化表示。

0 人点赞