This text was initially revealed by Ampere Computing.

This paper describes find out how to successfully use GNU Compiler Assortment (GCC) choices to assist optimize utility efficiency on Ampere Processors.

When trying to optimize an utility, it’s important to measure if a possible optimization improves efficiency. This consists of compiler choices. Utilizing superior compiler choices might lead to higher runtime efficiency, doubtlessly at the price of elevated compile time, extra debug difficulties, and sometimes elevated binary measurement. Why compiler choices have an effect on efficiency is past the scope of this paper, though the quick reply is that code technology, trendy processor architectures and the way they work together are very difficult! One other essential level is that completely different processors might profit from completely different compiler choices due to variations in laptop structure, and the particular microarchitecture. Repeated experimentation with optimizations is vital to efficiency success.

Tips on how to measure an utility’s efficiency to find out the limiting elements, in addition to optimization methods have already been lined in articles beforehand revealed. The paper, The First 10 Questions to Answer While Running on Ampere Altra-Based Instances, describes what efficiency knowledge to gather to grasp the complete system’s efficiency. A Performance Analysis Methodology for Optimizing Ampere Altra Family Processors explains find out how to optimize successfully & effectively utilizing a data-driven method.

This paper first summarizes the commonest GCC choices with an outline of how these choices have an effect on purposes. The dialogue then turns to current case research utilizing GCC choices to enhance efficiency of VP9 video encoding software program and MySQL database for Ampere Processors. Related methods have been successfully used to optimize further software program working on Ampere Processors.

GCC Suggestions

The GCC compiler offers many choices that may enhance utility efficiency. See the GCC website for particulars. To generate code that takes benefit of all of the efficiency options accessible in Ampere Processors, use the gcc -mcpu choice.

To make use of the gcc -mcpu choice, both set the CPU mannequin or inform GCC to make use of the CPU mannequin based mostly on the machine that GCC is working on by way of -mcpu=native. Word on legacy x86 based systems, gcc -mcpu is a deprecated synonym for -mtune, whereas gcc -mcpu is totally supported on Arm based mostly programs. See Arm’s information to Compiler flags across architectures: -march, -mtune, and -mcpu for particulars.

In abstract, each time attainable, use solely -mcpu and keep away from -march and -mtune when compiling for Arm. Under is a case research highlighting efficiency positive factors by setting the gcc -mcpu choice with VP9 video encoding software program.

Setting the -mcpu choice:

  • -mcpu=ampere1: Generate code that may run on AmpereOne Processors. AmpereOne is the subsequent technology of Cloud Native Processors from Ampere, extending the household of high-performance processors to new business main core counts. Word, this could generate code that won’t run on Ampere Altra and Altra Max Processors. This selection was initially accessible in GCC model 12.1 and later, then backported to GCC 10.5 and GCC 11.3.

  • -mcpu=neoverse-n1: Generate code that may run on Ampere Altra, Ampere Altra Max in addition to Ampere AmpereOne. Whereas utilizing this feature for code that may run on Ampere AmpereOne is supported, it’ll doubtlessly not make the most of all the brand new efficiency options accessible. Word, GCC model 9.1 or increased is required to allow CPU particular tunings for Ampere Altra and Ampere Altra Max processors.

  • -mcpu=native: Generate code setting the CPU mannequin based mostly on the CPU GCC is working on. Word, GCC model 9.1 or increased is required to allow CPU particular tunings for Ampere Altra and Ampere Altra Max processors.

Utilizing -mcpu=native is doubtlessly simpler to make use of, though it has a possible downside if the executable, shared library, or object file are used on a unique system. If the construct was achieved on an Ampere AmpereOne Processor, the code might not run on an Ampere Altra or Altra Max Processor as a result of the generated code might embrace Armv8.6+ directions supported on Ampere AmpereOne Processors. If the construct was achieved on an Ampere Altra or Altra Max processor, GCC is not going to make the most of the most recent efficiency enhancements accessible on Ampere AmpereOne Processors. This can be a normal problem when constructing code to make the most of efficiency options for any structure.

The next desk lists what GCC variations that help Ampere Processor -mcpu values.

Processor-mcpu WorthGCC 9GCC 10GCC 11GCC 12GCC 13
Ampere Altraneoverse-n1≥ 9.1ALLALLALLALL
Ampere Altra Maxneoverse-n1≥ 9.1ALLALLALLALL
AmpereOneampere1N/A≥ 10.5≥ 11.3≥ 12.1ALL

Our suggestion is to make use of the gcc -mcpu choice with the suitable worth described above (-mcpu=ampere1, -mcpu=neoverse-n1 or -mcpu=native) with -O2 to determine a baseline for efficiency, then discover further optimization choices and measuring if completely different choices enhance efficiency in comparison with the baseline.

Abstract of frequent GCC choices:

  • -mcpu Really useful when constructing on Ampere Processors to allow processor particular tuning and optimizations. (See dialogue “Setting the -mcpu choice” part above for particulars.)

  • -Os Optimize to scale back code measurement, doubtlessly in case your utility is proscribed by fetching directions.

  • -O2 Thought of commonplace GCC optimization choice and good to make use of as a baseline to match with different GCC choices.

  • -O3 Provides further optimizations to generate extra environment friendly codes for loops, helpful to attempt in case your utility efficiency is dominated by time spent in loops.

  • Profile Guided Optimization (PGO): -fprofile-generate & -fprofile-use. Generate profile knowledge that the compiler will use to doubtlessly make higher choices on optimizations equivalent to inlining, loop optimizations and default branches. That is thought of a complicated optimization because it requires modifications to the construct system, see beneath.

  • Hyperlink-Time Optimization (LTO): -flto. Allow link-time optimizations, permitting the compiler to optimize throughout particular person supply recordsdata. This permits features to be inlined throughout supply recordsdata amongst different compiler optimizations. That is additionally thought of a complicated optimization and doubtlessly requires modifications to the construct system. This selection will increase general construct time, which could be dramatic for big purposes. It’s attainable to make use of LTO simply on efficiency vital supply recordsdata to doubtlessly lower construct instances.

VP9 Video Encoding Case Examine with gcc -mcpu

VP9 is a video coding format developed by Google. libvpx is the open-source reference software program implementation for the VP8 and VP9 video codecs from Google and the Alliance for Open Media (AOMedia). libvpx offers vital enchancment in video compression over x264 with the expense of further computation time. Extra info on VP9 and libvpx is accessible on Wikipedia.

On this case research, the VP9 construct is configured to make use of the gcc -mcpu=native choice to enhance efficiency. As talked about above, use the -mcpu choice when compiling on Ampere Processors to allow CPU particular tuning and optimizations. Initially libvpx was constructed utilizing the default configuration after which rebuilt utilizing -mcpu=native. To judge VP9 efficiency, a 1080P enter video file, original_videos_Sports_1080P_Sports_1080P-0063.mkv from the YouTube’s User Generated Content Dataset was used. See Ampere’s ffmpeg tuning and build guide for particulars on find out how to construct ffmpeg and numerous codecs together with VP9 for Ampere Processors.

Default libvpx Construct:

$ git clone
$ cd libvpx/
$ export CFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdeclaration-after-statement -Wdisabled-optimization -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wuninitialized -Wunused -Wextra -Wundef -Wframe-larger-than=52000 -std=gnu89"
$ export CXXFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdisabled-optimization -Wextra-semi -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wmissing-declarations -Wuninitialized -Wunused -Wextra -Wno-psabi -Wc++14-extensions -Wc++17-extensions -Wc++20-extensions -std=gnu++11 -std=gnu++11"
$ ./configure
$ make verbose=1 
$ ./vpxenc --codec=vp9 --profile=0 --height=1080 --width=1920 --fps=25/1 --limit=100 -o output.mkv /dwelling/joneill/Movies/original_videos_Sports_1080P_Sports_1080P-0063.mkv --target-bitrate=2073600 --good --passes=1 --threads=1 –debug

Tips on how to Optimize libvpx Construct with -mcpu=native

$ # rebuild with -mcpu=native
$ make clear
$ export CFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdeclaration-after-statement -Wdisabled-optimization -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wuninitialized -Wunused -Wextra -Wundef -Wframe-larger-than=52000 -std=gnu89"
$ export CXXFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdisabled-optimization -Wextra-semi -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wmissing-declarations -Wuninitialized -Wunused -Wextra -Wno-psabi -Wc++14-extensions -Wc++17-extensions -Wc++20-extensions -std=gnu++11 -std=gnu++11"
$ ./configure 
$ make verbose=1 
# confirm the construct makes use of the sdot dot product instruction:
$ objdump -d vpxenc | grep sdot | wc -l
$ ./vpxenc --codec=vp9 --profile=0 --height=1080 --width=1920 --fps=25/1 --limit=100 -o output.mkv /dwelling/joneill/Movies/original_videos_Sports_1080P_Sports_1080P-0063.mkv --target-bitrate=2073600 --good --passes=1 --threads=1 --debug

An investigation utilizing Linux perf to measure the variety of CPU cycles within the features that took probably the most time embrace the features vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon. The libvpx git repository exhibits these features have been optimized by Arm to make use of the Armv8.6-A USDOT (mixed-sign dot-product) instruction which is supported by Ampere Processors.

The CPU cycles spent in vpx_convolve8_horiz_neon was diminished from 6.07E+11 to 2.52E+11 utilizing gcc -mcpu=native to allow the dot product optimization on an Ampere Altra processor, decreasing the CPU cycles by an element of two.4x.

For vpx_convolve8_vert_neon, the CPU cycles have been diminished from 2.46E+11 to 2.07E+11, for a 16% discount.

General, utilizing -mcpu=native to allow the dot product instruction sped up transcoding the file original_videos_Sports_1080P_Sports_1080P-0063.mkv by 7% on an Ampere Altra processor by enhancing the appliance throughput. The next desk exhibits knowledge collected utilizing the perf document and perf report utilities to measure CPU cycles and directions retired.

Construct ConfigImageCycle(%)CyclesDirections(%)Directions
Default Constructvpx_convolve8_horiz_neon8.726.07E+117.521.13E+12
Total Software1006.97E+101001.48E+11
Total Software1006.48E+101001.48E+11

GCC Profile Guided Optimization

This part offers an outline of GCC’s Profile Guided Optimization (PGO) and a case research of optimizing MySQL with PGO. Profile Information Optimizations allow GCC to make higher optimization choices, together with optimizing branches, code block reordering, inlining features and loops optimizations by way of loop unrolling, loop peeling and vectorization. Utilizing PGO requires modifying the construct atmosphere to do a 3-part construct.

  1. Construct utility with Profile Guided Optimization, gcc -fprofile-generate.
  2. Run utility on consultant workloads to generate the profile knowledge.
  3. Rebuild utility utilizing the profile knowledge, gcc -fprofile-use.

A problem of utilizing PGO is the extraordinarily excessive efficiency overhead in step 2 above. Because of the gradual efficiency working an utility constructed with gcc -fprofile-generate, it is probably not sensible to run on programs working in a manufacturing atmosphere. See the GCC guide’s Program Instrumentation Options part to construct purposes with run-time instrumentation and the part Options That Control Optimization for rebuilding utilizing the generated profile info for extra particulars.

As described within the GCC guide, -fprofile-update=atomic is beneficial for multi-threaded purposes, and might enhance efficiency by gathering improved profile knowledge.

When to Use PGO?

With PGO, GCC can higher optimize purposes by offering further info equivalent to measuring branches taken vs. not taken and measuring loop journey counts. PGO is a helpful optimization to try to see if it improves efficiency. Efficiency signatures the place PGO might assist embrace purposes with a major share of department mispredictions, which could be measured utilizing the perf utility to learn the CPU’s Efficiency Monitoring Unit (PMU) counter BR_MIS_PRED_RETIRED. Giant numbers of department mispredictions result in a excessive share of front-end stalls, which could be measured by the STALL_FRONTEND PMU counter. Functions with a excessive L2 instruction cache miss price may additionally profit from PGO, probably associated to mis-predicted branches. In abstract, a big share of department mispredictions, CPU entrance finish stalls and L2 instruction cache misses are efficiency signatures the place PGO can enhance efficiency.

MySQL database GCC PGO Case Examine

MySQL is the world’s hottest open-source database and as a result of large MySQL binary measurement, is a perfect candidate for utilizing GCC PGO optimization. With out PGO info, it’s unimaginable for GCC to appropriately predict the various completely different code paths executed. Utilizing PGO enormously reduces department misprediction, L2 instruction cache miss price and CPU entrance finish stalls on Ampere Altra Max Processor.

Summarizing how MySQL is optimized utilizing GCC PGO:

  1. sysbench was used to judge MySQL efficiency
  2. GCC PGO was educated utilizing MySQL MTR (mysql-test-run) test suite
  3. Sysbench’s oltp_point_select and oltp_read_only exams have been used to measure efficiency with PGO construct in comparison with the default construct
  4. The variety of threads used have been then diverse from 1 to 1024, giving a mean pace up of 29% for the oltp_point_select and 20% for the oltp_read_only take a look at on an Ampere Altra Max M128-30 processor
  5. With 64 threads, PGO improved efficiency by 32% by enhancing MySQL’s throughput

Extra particulars could be discovered on the Ampere Developer’s web site within the MySQL Tuning Guide.


Optimizing purposes requires experimenting with completely different methods to find out what works greatest. This paper offers suggestions for various GCC compiler optimizations to generate excessive performing purposes working on Ampere Processors. It highlights utilizing the -mcpu choice as the simplest option to generate code that takes benefit of all of the options supported by Ampere Cloud Native Processors. Two case research, for MySQL database and VP9 video encoder, present the usage of GCC choices to optimize these purposes the place efficiency is vital.

Constructed for sustainable cloud computing, Ampere’s first Cloud Native Processors ship predictable excessive efficiency, platform scalability, and energy effectivity unprecedented within the business. We invite you to be taught extra about our developer efforts and discover greatest practices at and be part of the dialog at