C++ Performance Optimization
C++ provides fine-grained control over performance. Understanding optimization techniques and measurement tools is essential for writing efficient code.
Measurement First
Never optimize without measuring. Use profilers to identify bottlenecks:
Profiling Tools
- perf (Linux): CPU profiling
- Valgrind/Callgrind: Call graph analysis
- gprof: GNU profiler
- Intel VTune: Advanced profiling
- Tracy: Real-time profiling
# Compile with debug symbols
g++ -g -O2 program.cpp -o program
# Profile with perf
perf record ./program
perf report
# Profile with Valgrind
valgrind --tool=callgrind ./program
kcachegrind callgrind.out.*
Compiler Optimization Levels
-O0 # No optimization (default, fastest compile)
-O1 # Basic optimization
-O2 # Recommended for release builds
-O3 # Aggressive optimization (may increase code size)
-Os # Optimize for size
-Ofast # -O3 + unsafe math optimizations
Enable link-time optimization:
g++ -O3 -flto program.cpp -o program
Memory Layout and Cache
Structure Packing
// Poor layout: 24 bytes (with padding)
struct Bad {
char a; // 1 byte + 7 padding
double b; // 8 bytes
char c; // 1 byte + 7 padding
};
// Good layout: 16 bytes
struct Good {
double b; // 8 bytes
char a; // 1 byte
char c; // 1 byte + 6 padding
};
static_assert(sizeof(Bad) == 24);
static_assert(sizeof(Good) == 16);
Cache-Friendly Data Structures
Prefer contiguous memory (vector) over linked structures:
// Cache-unfriendly
std::list<int> list; // Nodes scattered in memory
// Cache-friendly
std::vector<int> vec; // Contiguous memory
// Benchmark shows vector is often 10-100x faster for iteration
Structure of Arrays (SoA) vs Array of Structures (AoS)
// AoS: Poor cache utilization if you only need x coordinates
struct Point { float x, y, z; };
std::vector<Point> points;
for (const auto& p : points) {
process(p.x); // Loads y and z unnecessarily
}
// SoA: Better cache utilization
struct Points {
std::vector<float> x;
std::vector<float> y;
std::vector<float> z;
};
Points points;
for (float x_val : points.x) {
process(x_val); // Only loads x values
}
Move Semantics and RVO
Return Value Optimization (RVO)
Compiler automatically elides copies:
std::vector<int> create_vector() {
std::vector<int> vec(1000);
// ...
return vec; // No copy! RVO applied
}
auto v = create_vector(); // vec moved directly into v
Named Return Value Optimization (NRVO)
std::string build_string() {
std::string result;
result += "Hello";
result += " World";
return result; // NRVO: no copy
}
Explicit Move
std::vector<int> source = create_large_vector();
std::vector<int> dest = std::move(source); // Move, not copy
// source is now empty
Avoid Unnecessary Copies
Pass by Reference
// Bad: copies vector
void process(std::vector<int> vec) { }
// Good: references vector
void process(const std::vector<int>& vec) { }
// Better: use std::span for read-only access
void process(std::span<const int> data) { }
Returning Large Objects
// Good: RVO applies
std::vector<int> compute() {
std::vector<int> result;
// ...
return result;
}
// Don't do this:
void compute(std::vector<int>& out) { // Output parameter
// ...
}
Small String Optimization (SSO)
std::string optimizes for small strings:
std::string small = "Hello"; // No heap allocation (typically < 16 chars)
std::string large = "This is a very long string..."; // Heap allocated
Use string_view for temporary views:
void process(std::string_view sv) { // No copy
// ...
}
std::string str = "Hello";
process(str); // Efficient
process("World"); // No temporary string created
Reserve Capacity
Avoid reallocations:
// Bad: multiple reallocations
std::vector<int> vec;
for (int i = 0; i < 1000; ++i) {
vec.push_back(i); // May reallocate many times
}
// Good: single allocation
std::vector<int> vec;
vec.reserve(1000);
for (int i = 0; i < 1000; ++i) {
vec.push_back(i); // No reallocations
}
// Even better: direct initialization
std::vector<int> vec(1000);
std::iota(vec.begin(), vec.end(), 0);
emplace vs push
std::vector<std::string> vec;
// push_back: creates temporary, then moves
vec.push_back(std::string("Hello"));
// emplace_back: constructs in-place
vec.emplace_back("Hello"); // No temporary
// With complex types
std::vector<std::pair<int, std::string>> pairs;
pairs.push_back({1, "one"}); // Creates temporary pair
pairs.emplace_back(1, "one"); // Constructs directly
Inline Functions
Small functions benefit from inlining:
inline int square(int x) { return x * x; }
// Better: let compiler decide
constexpr int square(int x) { return x * x; }
Modern compilers inline automatically with optimization enabled.
Branch Prediction
Likely/Unlikely Hints (C++20)
int process(int x) {
if (x > 0) [[likely]] {
return x * 2;
} else {
return handle_error();
}
}
Avoid Branches in Tight Loops
// Bad: branch in loop
for (int i = 0; i < n; ++i) {
if (condition) {
result += array[i];
}
}
// Better: predication
for (int i = 0; i < n; ++i) {
result += array[i] * condition; // Branchless
}
Algorithm Choice
// O(n²) - slow for large n
std::vector<int> vec;
for (int i = 0; i < n; ++i) {
vec.erase(std::remove(vec.begin(), vec.end(), i), vec.end());
}
// O(n) - much faster
std::unordered_set<int> set(vec.begin(), vec.end());
Choose appropriate containers:
- std::vector: Random access, cache-friendly
- std::deque: Fast insertion at both ends
- std::list: Fast insertion/deletion in middle (but slow iteration)
- std::unordered_map: O(1) average lookup
- std::map: O(log n) lookup, ordered
Compile-Time Computation
Move work to compile-time:
// Runtime computation
int factorial_runtime(int n) {
int result = 1;
for (int i = 2; i <= n; ++i) {
result *= i;
}
return result;
}
// Compile-time computation
constexpr int factorial(int n) {
int result = 1;
for (int i = 2; i <= n; ++i) {
result *= i;
}
return result;
}
constexpr int fact5 = factorial(5); // Computed at compile-time
Parallel Algorithms (C++17)
#include <algorithm>
#include <execution>
std::vector<int> vec(1000000);
// Sequential
std::sort(vec.begin(), vec.end());
// Parallel
std::sort(std::execution::par, vec.begin(), vec.end());
// Parallel + vectorized
std::sort(std::execution::par_unseq, vec.begin(), vec.end());
String Operations
// Slow: creates many temporary strings
std::string result = str1 + str2 + str3 + str4;
// Fast: single allocation
std::string result;
result.reserve(str1.size() + str2.size() + str3.size() + str4.size());
result += str1;
result += str2;
result += str3;
result += str4;
// Or use std::format (C++20)
auto result = std::format("{}{}{}{}", str1, str2, str3, str4);
Virtual Function Overhead
// Virtual call: runtime dispatch
struct Base {
virtual void process() { }
};
// Template: compile-time dispatch (faster)
template<typename T>
void process(T& obj) {
obj.process(); // No virtual call
}
// Consider CRTP for static polymorphism
template<typename Derived>
struct Base {
void interface() {
static_cast<Derived*>(this)->implementation();
}
};
Memory Pools
For many small allocations:
#include <memory_resource>
std::pmr::monotonic_buffer_resource pool;
std::pmr::vector<int> vec(&pool); // Uses pool allocator
// All allocations come from pool - much faster
Benchmarking
Use Google Benchmark:
#include <benchmark/benchmark.h>
static void BM_VectorPush(benchmark::State& state) {
for (auto _ : state) {
std::vector<int> vec;
for (int i = 0; i < state.range(0); ++i) {
vec.push_back(i);
}
}
}
BENCHMARK(BM_VectorPush)->Range(8, 8<<10);
BENCHMARK_MAIN();
Common Pitfalls
- Premature optimization: Profile first!
- Over-optimization: Hurting readability
- Ignoring algorithmic complexity: O(n²) vs O(n log n)
- Not using compiler optimizations: Always use -O2 or -O3
- Unnecessary abstractions: Virtual functions when not needed
Best Practices
- Measure, don’t guess: Use profilers
- Optimize hot paths: Focus on 10% of code that takes 90% of time
- Choose right algorithms: Big-O matters more than micro-optimizations
- Use modern C++ idioms: Move semantics, RVO, constexpr
- Enable compiler optimizations: -O2/-O3, LTO
- Cache-friendly data structures: Contiguous memory
- Avoid copies: References, move, in-place construction
Conceptual Check
What should you do BEFORE optimizing code?
Interactive Lab
Complete the Code
std::vector<int> vec; // Reserve space for 1000 elements to avoid reallocations vec.(1000);
Runtime Environment
Interactive Lab
1#include <iostream>
2#include <vector>
3#include <chrono>
4
5int main() {
6 const int n = 1000000;
7
8 // Without reserve
9 auto start = std::chrono::high_resolution_clock::now();
10 std::vector<int> vec1;
11 for (int i = 0; i < n; ++i) {
12 vec1.push_back(i);
13 }
14 auto end = std::chrono::high_resolution_clock::now();
15 auto duration1 = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
16
17 // With reserve
18 start = std::chrono::high_resolution_clock::now();
19 std::vector<int> vec2;
20 vec2.reserve(n);
21 for (int i = 0; i < n; ++i) {
22 vec2.push_back(i);
23 }
24 end = std::chrono::high_resolution_clock::now();
25 auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
26
27 std::cout << "Without reserve: " << duration1.count() << " μs\n";
28 std::cout << "With reserve: " << duration2.count() << " μs\n";
29 std::cout << "Speedup: " << (double)duration1.count() / duration2.count() << "x\n";
30
31 return 0;
32}
System Console
Waiting for signal...