This text was initially revealed by Ampere Computing.

You’re operating your software on a brand new cloud occasion or a server (or SUT, a system below take a look at) and also you discover there’s a efficiency concern. Otherwise you want to guarantee you’re getting the very best efficiency, given the system assets at your disposal. This doc discusses some primary questions you need to ask and methods to reply these questions.

Conditions: Know Your VM or Server

Earlier than you begin troubleshooting or embarking on a efficiency evaluation train, you want to concentrate on the system assets at your disposal. System-level efficiency sometimes boils all the way down to 4 parts and the way they work together with one another — CPU, Reminiscence, Community, Disk. Additionally confer with Brendan Gregg’s glorious article Linux Performance Analysis in 60,000 milliseconds for a fantastic begin to shortly consider efficiency points.

This text explains dig deeper to know efficiency points.

Decide CPU Kind

Run the $lscpu command, and it’ll show the CPU sort, CPU Frequency, Variety of cores and different CPU related data:

ampere@colo1:~$ lscpu 

Structure:                    aarch64 

CPU op-mode(s):                  32-bit, 64-bit 

Byte Order:                      Little Endian 

CPU(s):                          160 

On-line CPU(s) checklist:             0-159 

Thread(s) per core:              1 

Core(s) per socket:              80 

Socket(s):                       2 

NUMA node(s):                    2 

Vendor ID:                       ARM 

Mannequin:                           1 

Mannequin identify:                      Neoverse-N1 

Stepping:                        r3p1 

CPU max MHz:                     3000.0000 

CPU min MHz:                     1000.0000 

BogoMIPS:                        50.00 

L1d cache:                       10 MiB 

L1i cache:                       10 MiB 

L2 cache:                        160 MiB 

NUMA node0 CPU(s):               0-79 

NUMA node1 CPU(s):               80-159 

Vulnerability Itlb multibit:     Not affected 

Vulnerability L1tf:              Not affected 

Vulnerability Mds:               Not affected 

Vulnerability Meltdown:          Not affected 

Vulnerability Mmio stale information:   Not affected 

Vulnerability Spec retailer bypass: Mitigation; Speculative Retailer Bypass disabled through prctl 

Vulnerability Spectre v1:        Mitigation; __user pointer sanitization 

Vulnerability Spectre v2:        Mitigation; CSV2, BHB 

Vulnerability Srbds:             Not affected 

Vulnerability Tsx async abort:   Not affected 

Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid 

                                  asimdrdm lrcpc dcpop asimddp ssbs 

Decide Reminiscence Configuration

Run the $free command, and it’ll present you details about the full quantity of bodily and swap reminiscence (together with the breakdown of reminiscence utilization). Run the Multichase benchmark to find out the latency, reminiscence bandwidth and load-latency of the occasion/SUT:

ampere@colo1:~$ free 

              whole        used        free      shared  buff/cache   out there 

Mem:      130256992     3422844   120742736        4208     6091412   125852984 

Swap:       8388604           0     8388604 

Assess Community Functionality

Run the $ethtool command, and it’ll present you details about the {hardware} settings of the NIC card. It is also used to regulate community system driver and {hardware} settings. In case you’re operating the workload within the client-server mannequin, it’s a good suggestion to know the Bandwidth and Latency between the shopper and the server. For figuring out the Bandwidth, a easy iperf3 take a look at can be ample, and for latency a easy ping take a look at would have the ability to offer you that worth. Within the client-server setup it’s additionally advisable to maintain the variety of community hops to a minimal. A traceroute is a community diagnostic command for displaying the route and measuring transit delays of packets throughout the community:

ampere@colo1:~$ ethtool -i enp1s0np0  

driver: mlx5_core 

model: 5.7-1.0.2 

firmware-version: 16.32.1010 (RCP0000000001) 


bus-info: 0000:01:00.0 

supports-statistics: sure 

supports-test: sure 

supports-eeprom-access: no 

supports-register-dump: no 

supports-priv-flags: sure> 

Perceive Storage Infrastructure

It’s important to know the disk capabilities earlier than you begin operating the workloads. Realizing the throughput and latency of your disk and the filesystems will provide help to plan and architect the workload successfully. Versatile I/O (or “fio”) is the software of alternative to find out these values.

Now On to the High 10 Questions

1. Are my CPUs getting used effectively?

One of many main parts of the Whole Price of Possession is the CPU. It’s due to this fact value discovering out how effectively CPUs are getting used. Idle CPUs sometimes imply there are exterior dependencies, like ready on disk or community accesses. It’s all the time a good suggestion to watch CPU utilization and to verify if core utilization is uniform.

A pattern output from command $high -1 is pictured beneath.

2. Are my CPUs operating on the highest frequencies potential?

Trendy CPUs use p-states to scale the frequency and voltage at which they run to scale back the facility consumption of the CPU when increased frequencies will not be wanted. That is referred to as Dynamic Voltage and Frequency Scaling (DVFS) and is managed by the OS. In Linux, p-states are managed by the CPUFreq subsystem, which use totally different algorithms (referred to as governors) to find out which frequency the CPU is to be run at. Usually, for performance-sensitive functions, it’s a good suggestion to make sure that the efficiency governor is used, and the next command makes use of the cpupower utility to attain that. Remember that the frequency utilization at which a CPU ought to run is workload dependent:

cpupower frequency-set –governor efficiency 

To verify the frequency of the CPU whereas operating your software, run the next command:

ampere@colo1:~$ cpupower frequency-info 

analyzing CPU 0: 

  driver: cppc_cpufreq 

  CPUs which run on the similar {hardware} frequency: 0 

  CPUs which have to have their frequency coordinated by software program: 0 

  most transition latency: Can't decide or will not be supported. 

  {hardware} limits: 1000 MHz - 3.00 GHz 

  out there cpufreq governors: conservative ondemand userspace powersave efficiency schedutil 

  present coverage: frequency ought to be inside 1000 MHz and 3.00 GHz. 

                  The governor "ondemand" might determine which velocity to make use of 

                  inside this vary. 

  present CPU frequency: Unable to name {hardware} 

  present CPU frequency: 1000 MHz (asserted by name to kernel) 


3. How a lot time am I spending in my software versus kernel time?

It’s generally obligatory to search out out what proportion of the CPU’s time is consumed in consumer house versus privileged time (i.e., kernel house). Excessive kernel time is perhaps justified for a sure class of workloads (network-bound workloads, for instance) however can be a sign of an issue.

The Linux software high can be utilized to search out out the consumer vs. kernel time consumption as proven beneath.

  • Mpstat — look at statistics per CPU and verify for particular person scorching/busy CPUs. This can be a multiprocessor statics software, and might report statistics per CPU (-P choice)
  • CPU: Logical CPU ID, or all for abstract
  • %usr: Person Time, excluding %good
  • %good: Person Time for processes with a niced precedence
  • %sys: System Time
  • %iowait: IO wait
  • %irq : {Hardware} interrupt CPU utilization
  • %mushy: Software program interrupt CPU utilization
  • %steal: Time spent servicing different tenants
  • %visitor: CPU time spent in visitor Digital Machines
  • %gnice: CPU time to run a niced visitor
  • %idle: Idle

To determine CPU utilization per CPU and present the user-time/kernel time ratio %usr, %sys, and %idle are the important thing values. These key values can even assist determine “scorching” CPUs which may be brought on by single threaded functions or interrupt mapping.

4. Do I’ve sufficient reminiscence for my software?

If you end up managing a server, you may need to put in a brand new software, otherwise you would possibly discover that the appliance has began to decelerate. For managing your system assets and understanding your put in system reminiscence and reminiscence utilization by the system the $free command is a beneficial software. $vmstat can also be a beneficial software to watch reminiscence utilization and in case you are actively swapping your reminiscence together with your digital reminiscence.

  • Free. The Linux free command exhibits reminiscence and swap statistics.

    The output exhibits the full, used and free reminiscence of the system. An necessary column is the out there worth, which exhibits out there reminiscence to an software with the necessity of swap. It additionally accounts for the reminiscence which can’t be reclaimed instantly

  • Vmstat. This command gives a high-level view of system reminiscence, well being, together with at present free reminiscence and paging statistics.

    The $vmstat command exhibits energetic Reminiscence being swapped out (paging).

The instructions print the abstract of the present standing. The columns are in kilobytes by default and are:

  • Swpd: Quantity of swapped out reminiscence
  • Free: Free out there reminiscence
  • Buff: Reminiscence within the buffer cache
  • Cache: Reminiscence within the web page cache
  • Si: Reminiscence swapped in (paging)
  • So: Reminiscence swapped out (paging)

If the si and the so are non-zero, the system is below reminiscence strain and is swapping reminiscence to the swap system.

5. Am I getting the suitable quantity of reminiscence bandwidth?

To know the suitable quantity of reminiscence bandwidth, first get the “Max Reminiscence Bandwidth” worth of your system. The “Max Reminiscence Bandwidth” worth may be discovered by:

  • Base DRAM clock Frequency
  • Variety of Knowledge Transfers per clock: two, in case of “double information price” (DDR*) reminiscence
  • Reminiscence bus (interface) width: for Instance, DDR 3 is 64 bits broad (additionally known as line)
  • Variety of interfaces: trendy private computer systems sometimes use two reminiscence interfaces (dual-channel mode) for an efficient 128-bit bus width
  • Max Reminiscence Bandwidth = Base DRAM clock Frequency * Variety of Knowledge Transfers per clock * Reminiscence base width * Variety of interfaces

This worth represents the theoretical most bandwidth of the system, also referred to as the “burst price”. Now you can run benchmarks like Multichase, or Bandwidth towards the system and confirm the values.

Observe: it has been seen that the burst charges is probably not sustainable, and the values achieved is perhaps a bit lower than calculated.

6. Is my workload utilizing all my CPUs in a balanced method?

When operating workloads in your server, as a part of efficiency tuning or troubleshooting, you might wish to know on which CPU core a specific course of is at present scheduled and gather efficiency statistics of the method operating on that CPU core. Step one can be to search out the method operating on the CPU core. This may be performed utilizing the htop. The CPU worth doesn’t replicate on the default show of htop. To get the CPU core worth, launch $htop from the command line, press the F2 key, go to the “Columns”, and add “Processor” below the “Out there Columns”. The at present used “CPU ID” of every course of will seem below the “CPU” column.

  • The best way to configure $htop to point out CPU/core:

  • $htop command exhibiting core 4-6 maxed out (htop core rely begin from “1” as a substitute of “0”):

  • $mpstat command for chosen cores to look at statistics:

After you have recognized the CPU core, you possibly can run the $mpstat command to look at statistics per CPU and verify for particular person scorching/busy CPUs. This can be a multiprocessor statics software and might report statistics per CPU (or core). For extra data on $mpstat see the “How much time am I spending in my application versus kernel time?” part above.

7. Is my community a bottleneck for my software?

Community bottlenecking can occur even earlier than you saturate different assets on the server. This concern is discovered when a workload is being run in a client-server mannequin. The very first thing it’s essential to do is decide how your community seems to be. The latency and bandwidth between the shopper and the server is very necessary. Instruments like iperf3, ping and traceroute are easy instruments which will help you establish the boundaries of your community. After you have decided the boundaries in case your community, instruments like $dstat and $nicstat provide help to monitor the community utilization and decide any bottlenecking occurring together with your system because of networking.

  • Dstat. This command is used to watch the system assets, together with CPU stats, Disk stats, Community stats, paging stats, and system stats. For monitoring the community utilization use the -n choice.

    The command will give the throughput for packets acquired and despatched by the system.

  • Nicstat. This command prints community interface statistics, together with throughput and utilization.

The columns embody:

  • Int: interface identify
  • %util: the utmost utilization
  • Sat: worth reflecting interface saturation statistics
  • Values prefix “r” = learn /obtain
  • Values prefix “w” = write/transmit
  • 1- KB/s: KiloByes per second
  • 2- Pk/s: packets per second
  • 3- Avs/s: Common packet dimension in bytes

8. Is my disk a bottleneck?

Like Community, disk can be the explanation for a low performing software. In terms of measuring disk efficiency, we have a look at the next indicators:

  • Utilization
  • Saturation
  • IOPS (Enter/Output Per Second)
  • Throughput
  • Response time

rule is that if you end up choosing a server/occasion for an software, you need to first carry out a benchmark take a look at on the I/O efficiency of the disk to be able to get the height worth or “ceiling” of the disk efficiency and in addition have the ability to decide of the disk efficiency meets the wants of the appliance. Versatile I/O is the software of alternative to find out these values.

As soon as the appliance is operating, you should utilize $iostat and $dstat to watch the disk useful resource utilization in actual time.

The iostat command exhibits the per-disk I/O statistics, proving metrics for workload characterization, utilization, and saturation.

The primary output line exhibits the abstract of the system, together with the kernel model, host identify, information structure and CPU rely. The second line exhibits the abstract of the system since boot time for the CPUs.

For every disk system proven within the subsequent rows, it exhibits the essential particulars within the columns:

  • Tps: Transactions per second
  • kB_read/s: Kilobytes learn per second
  • kB_wrtn/s: Kilobytes written per second
  • kB_read: Whole Kilobytes learn
  • KB_write: Whole Kilobytes written

The dstat command is used to watch the system assets, together with CPU stats, Disk stats, Community stats, paging stats, and system stats. For monitoring the disk utilization use the -d choice. The choice will present the full variety of learn (learn) and write (writ) operations on disks.

The picture beneath demonstrates a write intensive workload.

9. Am I paying a NUMA penalty?

Non-uniform reminiscence entry (NUMA) is a pc reminiscence design utilized in multiprocessing, the place the reminiscence entry time depends upon the reminiscence location relative to the processor. Beneath NUMA, a processor can entry its personal native reminiscence sooner than non-local reminiscence (reminiscence native to a different processor or reminiscence shared between processors). The advantages of NUMA are restricted to workloads, notably on servers the place the info is commonly related strongly with sure duties or customers.

On a NUMA system, the higher the gap between the processor and its reminiscence financial institution, the slower the processor entry to that reminiscence financial institution is. For Efficiency-sensitive software the system OS ought to allocate reminiscence from the closet potential reminiscence financial institution. To observe in actual time the reminiscence allocation of the system or a course of, $numastat is a superb software to make use of.

The numastat command gives statistics for non-uniform reminiscence entry (NUMA) programs. These programs are sometimes programs with a number of CPU sockets.

Linux OS tries to allocation reminiscence on the closest NUMA node, and $numastat exhibits the present statistics of the reminiscence allocation.

  • Numa_hit: Reminiscence allocation on the meant NUMA node
  • Numa_miss: Exhibits native allocation that ought to have been elsewhere
  • Numa_foreign: exhibits distant allocation that ought to been native
  • Other_node: Reminiscence allocation on this node whereas the method is operating elsewhere

Each numa_miss and Numa_foreign present reminiscence allocations not on the popular NUMA node. In a super scenario the values of numa_miss and numa_foreign ought to be saved to the minimal, as increased values end result and poor reminiscence I/O efficiency.

The $numastat -p <course of -id> command can be used to see the NUMA distribution of a course of.

10. What’s my CPU doing when I’m operating my software?

When operating an software in your system/occasion you’ll be concerned with realizing what the appliance is doing and assets utilized by the appliance in your CPU. $pidstat is a command-line software which might monitor each particular person course of operating on the system.

pidstat will break down the highest CPU customers into user-time and system-time.

This Linux software prints CPU utilization by course of or thread, together with consumer and system time. This command can even report IO statics of a course of (-d choice).

  • UID: The actual consumer identification variety of the duty being monitored
  • PID: The identification variety of the duty being monitored
  • %usr: Proportion of CPU utilized by the duty whereas executing on the consumer stage (software), with out good precedence.
  • %system: P.c of CPU utilized by the duty whereas executing on the system stage (kernel)
  • %wait: P.c of CPU spent by the duty whereas ready to run
  • %CPU: Whole proportion of CPU time utilized by the duty.
  • CPU: Processor/core quantity to which the duty is hooked up

$pidstat -p may be additionally run to assemble information on a specific course of.

Discuss to our knowledgeable sales team about partnerships or study entry to Ampere Techniques by way of our Developer Access Programs.