C++ Performance Optimization

C++ provides fine-grained control over performance. Understanding optimization techniques and measurement tools is essential for writing efficient code.

Measurement First

Never optimize without measuring. Use profilers to identify bottlenecks:

Profiling Tools

perf (Linux): CPU profiling
Valgrind/Callgrind: Call graph analysis
gprof: GNU profiler
Intel VTune: Advanced profiling
Tracy: Real-time profiling

# Compile with debug symbols
g++ -g -O2 program.cpp -o program

# Profile with perf
perf record ./program
perf report

# Profile with Valgrind
valgrind --tool=callgrind ./program
kcachegrind callgrind.out.*

Compiler Optimization Levels

-O0  # No optimization (default, fastest compile)
-O1  # Basic optimization
-O2  # Recommended for release builds
-O3  # Aggressive optimization (may increase code size)
-Os  # Optimize for size
-Ofast  # -O3 + unsafe math optimizations

Enable link-time optimization:

g++ -O3 -flto program.cpp -o program

Memory Layout and Cache

Structure Packing

// Poor layout: 24 bytes (with padding)
struct Bad {
    char a;    // 1 byte + 7 padding
    double b;  // 8 bytes
    char c;    // 1 byte + 7 padding
};

// Good layout: 16 bytes
struct Good {
    double b;  // 8 bytes
    char a;    // 1 byte
    char c;    // 1 byte + 6 padding
};

static_assert(sizeof(Bad) == 24);
static_assert(sizeof(Good) == 16);

Cache-Friendly Data Structures

Prefer contiguous memory (vector) over linked structures:

// Cache-unfriendly
std::list<int> list;  // Nodes scattered in memory

// Cache-friendly
std::vector<int> vec;  // Contiguous memory

// Benchmark shows vector is often 10-100x faster for iteration

Structure of Arrays (SoA) vs Array of Structures (AoS)

// AoS: Poor cache utilization if you only need x coordinates
struct Point { float x, y, z; };
std::vector<Point> points;

for (const auto& p : points) {
    process(p.x);  // Loads y and z unnecessarily
}

// SoA: Better cache utilization
struct Points {
    std::vector<float> x;
    std::vector<float> y;
    std::vector<float> z;
};

Points points;
for (float x_val : points.x) {
    process(x_val);  // Only loads x values
}

Move Semantics and RVO

Return Value Optimization (RVO)

Compiler automatically elides copies:

std::vector<int> create_vector() {
    std::vector<int> vec(1000);
    // ...
    return vec;  // No copy! RVO applied
}

auto v = create_vector();  // vec moved directly into v

Named Return Value Optimization (NRVO)

std::string build_string() {
    std::string result;
    result += "Hello";
    result += " World";
    return result;  // NRVO: no copy
}

Explicit Move

std::vector<int> source = create_large_vector();
std::vector<int> dest = std::move(source);  // Move, not copy
// source is now empty

Avoid Unnecessary Copies

Pass by Reference

// Bad: copies vector
void process(std::vector<int> vec) { }

// Good: references vector
void process(const std::vector<int>& vec) { }

// Better: use std::span for read-only access
void process(std::span<const int> data) { }

Returning Large Objects

// Good: RVO applies
std::vector<int> compute() {
    std::vector<int> result;
    // ...
    return result;
}

// Don't do this:
void compute(std::vector<int>& out) {  // Output parameter
    // ...
}

Small String Optimization (SSO)

std::string optimizes for small strings:

std::string small = "Hello";  // No heap allocation (typically < 16 chars)
std::string large = "This is a very long string...";  // Heap allocated

Use string_view for temporary views:

void process(std::string_view sv) {  // No copy
    // ...
}

std::string str = "Hello";
process(str);  // Efficient
process("World");  // No temporary string created

Reserve Capacity

Avoid reallocations:

// Bad: multiple reallocations
std::vector<int> vec;
for (int i = 0; i < 1000; ++i) {
    vec.push_back(i);  // May reallocate many times
}

// Good: single allocation
std::vector<int> vec;
vec.reserve(1000);
for (int i = 0; i < 1000; ++i) {
    vec.push_back(i);  // No reallocations
}

// Even better: direct initialization
std::vector<int> vec(1000);
std::iota(vec.begin(), vec.end(), 0);

emplace vs push

std::vector<std::string> vec;

// push_back: creates temporary, then moves
vec.push_back(std::string("Hello"));

// emplace_back: constructs in-place
vec.emplace_back("Hello");  // No temporary

// With complex types
std::vector<std::pair<int, std::string>> pairs;
pairs.push_back({1, "one"});      // Creates temporary pair
pairs.emplace_back(1, "one");     // Constructs directly

Inline Functions

Small functions benefit from inlining:

inline int square(int x) { return x * x; }

// Better: let compiler decide
constexpr int square(int x) { return x * x; }

Modern compilers inline automatically with optimization enabled.

Branch Prediction

Likely/Unlikely Hints (C++20)

int process(int x) {
    if (x > 0) [[likely]] {
        return x * 2;
    } else {
        return handle_error();
    }
}

Avoid Branches in Tight Loops

// Bad: branch in loop
for (int i = 0; i < n; ++i) {
    if (condition) {
        result += array[i];
    }
}

// Better: predication
for (int i = 0; i < n; ++i) {
    result += array[i] * condition;  // Branchless
}

Algorithm Choice

// O(n²) - slow for large n
std::vector<int> vec;
for (int i = 0; i < n; ++i) {
    vec.erase(std::remove(vec.begin(), vec.end(), i), vec.end());
}

// O(n) - much faster
std::unordered_set<int> set(vec.begin(), vec.end());

Choose appropriate containers:

std::vector: Random access, cache-friendly
std::deque: Fast insertion at both ends
std::list: Fast insertion/deletion in middle (but slow iteration)
std::unordered_map: O(1) average lookup
std::map: O(log n) lookup, ordered

Compile-Time Computation

Move work to compile-time:

// Runtime computation
int factorial_runtime(int n) {
    int result = 1;
    for (int i = 2; i <= n; ++i) {
        result *= i;
    }
    return result;
}

// Compile-time computation
constexpr int factorial(int n) {
    int result = 1;
    for (int i = 2; i <= n; ++i) {
        result *= i;
    }
    return result;
}

constexpr int fact5 = factorial(5);  // Computed at compile-time

Parallel Algorithms (C++17)

#include <algorithm>
#include <execution>

std::vector<int> vec(1000000);

// Sequential
std::sort(vec.begin(), vec.end());

// Parallel
std::sort(std::execution::par, vec.begin(), vec.end());

// Parallel + vectorized
std::sort(std::execution::par_unseq, vec.begin(), vec.end());

String Operations

// Slow: creates many temporary strings
std::string result = str1 + str2 + str3 + str4;

// Fast: single allocation
std::string result;
result.reserve(str1.size() + str2.size() + str3.size() + str4.size());
result += str1;
result += str2;
result += str3;
result += str4;

// Or use std::format (C++20)
auto result = std::format("{}{}{}{}", str1, str2, str3, str4);

Virtual Function Overhead

// Virtual call: runtime dispatch
struct Base {
    virtual void process() { }
};

// Template: compile-time dispatch (faster)
template<typename T>
void process(T& obj) {
    obj.process();  // No virtual call
}

// Consider CRTP for static polymorphism
template<typename Derived>
struct Base {
    void interface() {
        static_cast<Derived*>(this)->implementation();
    }
};

Memory Pools

For many small allocations:

#include <memory_resource>

std::pmr::monotonic_buffer_resource pool;
std::pmr::vector<int> vec(&pool);  // Uses pool allocator

// All allocations come from pool - much faster

Benchmarking

Use Google Benchmark:

#include <benchmark/benchmark.h>

static void BM_VectorPush(benchmark::State& state) {
    for (auto _ : state) {
        std::vector<int> vec;
        for (int i = 0; i < state.range(0); ++i) {
            vec.push_back(i);
        }
    }
}

BENCHMARK(BM_VectorPush)->Range(8, 8<<10);

BENCHMARK_MAIN();

Common Pitfalls

Premature optimization: Profile first!
Over-optimization: Hurting readability
Ignoring algorithmic complexity: O(n²) vs O(n log n)
Not using compiler optimizations: Always use -O2 or -O3
Unnecessary abstractions: Virtual functions when not needed

Best Practices

Measure, don’t guess: Use profilers
Optimize hot paths: Focus on 10% of code that takes 90% of time
Choose right algorithms: Big-O matters more than micro-optimizations
Use modern C++ idioms: Move semantics, RVO, constexpr
Enable compiler optimizations: -O2/-O3, LTO
Cache-friendly data structures: Contiguous memory
Avoid copies: References, move, in-place construction

Conceptual Check

What should you do BEFORE optimizing code?

Interactive Lab

Complete the Code

std::vector<int> vec;
// Reserve space for 1000 elements to avoid reallocations
vec.(1000);

Runtime Environment

Interactive Lab

#include <iostream>
#include <vector>
#include <chrono>

int main() {
    const int n = 1000000;
    
    // Without reserve
    auto start = std::chrono::high_resolution_clock::now();
    std::vector<int> vec1;
    for (int i = 0; i < n; ++i) {
        vec1.push_back(i);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto duration1 = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    
    // With reserve
    start = std::chrono::high_resolution_clock::now();
    std::vector<int> vec2;
    vec2.reserve(n);
    for (int i = 0; i < n; ++i) {
        vec2.push_back(i);
    }
    end = std::chrono::high_resolution_clock::now();
    auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    
    std::cout << "Without reserve: " << duration1.count() << " μs\n";
    std::cout << "With reserve: " << duration2.count() << " μs\n";
    std::cout << "Speedup: " << (double)duration1.count() / duration2.count() << "x\n";
    
    return 0;
}

1#include <iostream>

2#include <vector>

3#include <chrono>

5int main() {

6 const int n = 1000000;

8 // Without reserve

9 auto start = std::chrono::high_resolution_clock::now();

10 std::vector<int> vec1;

11 for (int i = 0; i < n; ++i) {

12 vec1.push_back(i);

13 }

14 auto end = std::chrono::high_resolution_clock::now();

15 auto duration1 = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

17 // With reserve

18 start = std::chrono::high_resolution_clock::now();

19 std::vector<int> vec2;

20 vec2.reserve(n);

21 for (int i = 0; i < n; ++i) {

22 vec2.push_back(i);

23 }

24 end = std::chrono::high_resolution_clock::now();

25 auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

27 std::cout << "Without reserve: " << duration1.count() << " μs\n";

28 std::cout << "With reserve: " << duration2.count() << " μs\n";

29 std::cout << "Speedup: " << (double)duration1.count() / duration2.count() << "x\n";

31 return 0;

32}

System Console

Waiting for signal...