Better perf for Android using SchedTune and SCHED_DEADLINE

2022-05-13 18:56:02 浏览数 (1)

SCHED_FIFO in Android (today) ● Used for some latency sensitive tasks ○ SurfaceFlinger (3-8ms every 16ms, RT priority 98) ○ Audio (<1ms every 3-5ms, low RT priority) ○ schedfreq kthread(s) (sporadic and unbounded, RT priority 50) ○ others ● Other latency sensitive tasks that are NOT SCHED_FIFO ○ UI thread (where app code resides, handles most animation and input events) ○ Render thread (generates actual OpenGL commands used to draw UI) ○ not SCHED_FIFO because ■ load balancing CPU selection is naive ■ RT throttling is too strict ■ Risk that these tasks can DoS CPUs

SCHED_FIFO (and beyond?) ● use SCHED_FIFO for UI and Render threads ○ Userspace support already in N-DR (to be released in AOSP in Dec timeframe) ○ EAS integrated RT cpu selection in-flight (to be part of MR2 release) ○ Results: ~10% (90th), ~12% (95th) and ~23%(99th) improvements in perf/Watt for jank benchmarks ● TEMP_FIFO ○ demote to CFS instead of throttling (RT throttling)

SCHED_DEADLINE (instead of SCHED_FIFO?) ✓ long term ambition is to provide better QoS using SCHED_DEADLINE https://linuxplumbersconf.org/2015/ocw//system/presentations/3063/original/lelli_slides.pdf ✓ if prototyping results are positive, mainline adoption of required modifications should be easier to achieve (w.r.t. modifying SCHED_FIFO) x missing features ○ https://github.com/jlelli/sched-deadline/wiki/TODOs ■ reclaiming (short term flexibility) ■ integration with schedutil ■ cgroup based scheduling ● demotion to CFS guinea pig for next steps will probably be SurfaceFlinger (16ms period, 3-8 ms runtime)

SchedTune in a Nutshell ● Enables the collection of task related information from informed runtimes ○ using a localized tuning interface to balance Energy Efficiency vs Performance Boost ○ extending Sched{Freq,Util} for OPP Selection and EAS for Task Placement ● OPP Selection: running at higher/lower OPP ○ makes a CPU appear artificially more (or less) utilized than it actually is ○ depending on which tasks are currently active on that CPU ● Task Placement: biasing CPU selection in the wake-up path ○ based on evaluation of the power-vs-performance trade-off ○ using a performance index definition which helps define: how much power are we willing to spend to get a certain speedup for task time-to-completion? ● Uses CGroups to provide both global and per-task boosting ○ simple yet effective support for task classification ○ allows for more advanced use-cases where the boost value is tuned at run-time e.g. replace powersave/performance governors, support for touch boosting...

SchedTune Discussion Points ● Is the CGroups interface a viable solution for mainline integration? ○ CGroups v2 discussions about per-process (instead of per-task) interface? ○ Are the implied overheads (e.g. for moving tasks) acceptable? ● How can we improve the definition of SchedTune’s performance index? ○ How much is task performance affected by certain scheduling decision? ○ How can we factor in all the potential slow-down threat? e.g. co-scheduling, higher priority tasks, blocked utilization, interrupts pressure, etc ● Is negative boosting useful? Can we prove useful and improve the support for negative boosting? ○ Where/When is useful to artificially lower the perceived utilization of a task? identify use cases, e.g. background tasks, memory bounded tasks

Performance Boosting: What Does it Means? ● Speedup the time-to-completion for a task activation ○ by running at an higher capacity CPU (i.e. OPP) ■ i.e. small tasks on big cores and/or using higher OPPs ● To achieve such a goal we need: ○ A) Boosting strategy ■ Evaluate how much “CPU bandwidth” is required by a task ○ B) CPU selection biasing mechanism ■ Select a Cluster/CPU which (can) provide that bandwidth ■ Evaluate if the energy-performance trade-off is acceptable ○ C) OPP selection biasing mechanism ■ Configure selected CPU to provide (at least) that bandwidth ■ ... but possibly only while a boosted task is RUNNABLE on that CPU ○ ... do all that with no noticeable overhead

Patches Availablity and List Discussions ● The initial full stack has been split in two series ○ 1) Non EAS dependant bits ■ OPP selection biasing ■ Global boosting strategy ■ CGroups based per-task boosting support Posted on LKML as RFCv1[1] and RFCv2[2] ○ 2) EAS dependant bits ■ CPU selection biasing ■ Energy model filtering Available on AOSP and LSK for kernels 3.18 and v4.4 [3,4] [1] https://lkml.org/lkml/2015/9/15/679 [2] http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1259645.html [3] https://android.googlesource.com/kernel/common/ /android-3.18 [4] https://android.googlesource.com/kernel/common/ /android-4.4

Boosting Strategy: Bandwidth Margin Computation ● Task utilization defines the task's required CPU bandwidth ○ To boost a task we need to inflate this requirement by adding a “margin” ○ Many different strategies/policies can be defined ● Main goals ○ Well defined meaning from user-space ■ 0% boost run @ min required capacity (MAX energy efficiency) ■ 100% boost run @ MAX possible speed (min time to completion) ■ 50%? ==> “something” exactly in between the previous two ○ Easy integration with SchedFreq and EAS ■ By working on top of already used signals ■ Thus providing a different “view” on the SEs/RQs utilization signals

Signal Proportional Compensation (SPC) ● The boost value is converted into an additional margin ○ Which is computed to compensate for max performance ■ i.e. the boost margin is a function of the current and max utilization margin = boost pct ∗(max capacity − cur capacity) , boost pct ∈[0,1]

SchedTune Performance Index ● Based on the composition of two metrics Perf_idx = SpeedUp_idx − Delay_idx ● SpeedUp_Index: how much faster can the task run? SpeedUp_idx = SUI = cpu_boosted_capacity − task_util ● Delay_Index: how much slowed-down can the task be? Delay_idx = DLI = 1024 * cpu_util / (task_util cpu_util)

0 人点赞