Gy3ZRPV8SYZ53gDjSFGpi7ej1KCaPY791pMbjB9m
Bookmark

Unmasking Noisy Neighbors: How Netflix Leverages eBPF for Enhanced Infrastructure Observability

Unmasking Noisy Neighbors: How Netflix Leverages eBPF for Enhanced Infrastructure Observability

Unmasking Noisy Neighbors: How Netflix Leverages eBPF for Enhanced Infrastructure Observability

In the bustling world of cloud computing, Netflix, like many other tech giants, faces a constant challenge: ensuring seamless performance for its diverse user base while managing a multi-tenant environment. This complex ecosystem often encounters the "noisy neighbor" problem, where a single container or system service hogging server resources can inadvertently slow down its neighbors.

This article delves into how Netflix's Compute and Performance Engineering teams tackled this problem by leveraging eBPF, a powerful technology that allows for low-overhead kernel instrumentation. Let's explore how this innovative approach transformed their infrastructure observability and empowered them to proactively manage noisy neighbor issues.

Understanding the Noisy Neighbor Problem

Imagine a bustling city street with multiple cars vying for space. If one car suddenly decides to park in the middle of the road, it creates a bottleneck, causing delays and frustration for everyone else. Similarly, in a multi-tenant environment, a container that consumes a disproportionate amount of CPU resources acts like a "noisy neighbor," slowing down its neighbors.

This problem often manifests as increased latency in the user's experience, making it crucial to identify and address the root cause. However, traditional performance analysis tools like "perf" introduce significant overhead, potentially worsening the issue they are meant to diagnose. Moreover, these tools are often deployed after the problem has already occurred, making it difficult to pinpoint the culprit.

eBPF: A Game Changer for Continuous Instrumentation

Netflix's engineers recognized the need for a continuous, low-overhead solution to monitor and mitigate noisy neighbors. They turned to eBPF (Extended Berkeley Packet Filter), a powerful technology that provides a framework for running programs inside the Linux kernel without disrupting its core functionality.

eBPF's key advantage lies in its ability to tap into the Linux scheduler, the heart of the operating system that determines which processes get to run on the CPU and for how long. By instrumenting the scheduler, Netflix engineers could gain real-time insights into container behavior and pinpoint the source of performance degradation.

Instrumenting the Run Queue Latency

To detect noisy neighbors, Netflix engineers focused on a crucial metric: run queue latency. This metric measures the time a process spends waiting in the scheduler's queue before getting allocated CPU time. Extended waiting in this queue can indicate potential performance issues, particularly when a container is not fully utilizing its allocated CPU resources.

They implemented eBPF probes using three key hooks:

  • sched_wakeup and sched_wakeup_new: These hooks are triggered whenever a process transitions from a "sleeping" state to a "runnable" state. They allow engineers to identify when a process is ready to run and waiting for CPU time. A timestamp is generated and stored in an eBPF hash map, using the process ID as the key.
  • sched_switch: This hook is triggered when the CPU switches between processes. It provides access to the process currently running and the process about to take over. By accessing the upcoming task's process ID, engineers can retrieve the associated timestamp from the eBPF map, representing the time the process entered the queue. This information allows them to calculate the run queue latency by subtracting the timestamps.

Navigating Kernel Data Structures

eBPF's ability to access actual kernel data structures, such as process structs (also known as tasks), provides valuable information about processes. For Netflix's use case, they needed to associate a process with its corresponding container by retrieving the cgroup ID. However, cgroup information within the process struct is protected by an RCU (Read Copy Update) lock.

To access this RCU-protected information safely and efficiently, engineers leveraged kfuncs – kernel functions callable from eBPF programs. These functions, specifically for locking and unlocking RCU read-side critical sections, ensure that the eBPF program remains safe while retrieving the cgroup ID from the task struct.

Packaging and Transmitting Data with eBPF Ring Buffers

Once the data is ready, it needs to be packaged and transmitted to userspace for analysis. Netflix engineers chose the eBPF ring buffer, a highly efficient and user-friendly mechanism. Ring buffers allow for variable-length data records and enable data reading without the need for extra memory copying or system calls.

However, the sheer volume of data points generated could strain the userspace application's CPU. To address this, they implemented a rate limiter within the eBPF program itself, sampling the data to maintain optimal performance.

Userspace Application: Processing and Visualizing Data

Netflix developed a userspace application using Go to process events from the eBPF ring buffer and emit metrics to their monitoring backend, Atlas. Each event contains a run queue latency sample with a cgroup ID. These metrics are further categorized and visualized:

  • Containers: If a cgroup ID is associated with a container, a percentile timer Atlas metric (runq.latency) is generated for that container.
  • System Services: If no container association is found, the event is categorized as a system service.
  • Preemptions: A counter metric (sched.switch.out) is incremented to track preemptions occurring for each container's processes. The prev_cgroup_id of the preempted process allows for tagging the metric with the cause of preemption, whether it's due to another process within the same container, a process in another container, or a system service.

Unveiling the Noisy Neighbor: A Case Study

Netflix's metrics clearly demonstrated the power of their eBPF-based approach in identifying noisy neighbor issues. In one scenario, launching a new container (container2) that fully utilized all CPUs on the host led to a significant spike in the 99th percentile run queue latency (runq.latency) for an existing container (container1).

The sched.switch.out metric confirmed that this spike was caused by increased preemptions from system processes, indicating that system services were competing with containers for CPU time. This pointed to the newly launched container (container2) as the noisy neighbor, triggering system processes to compete for resources due to its heavy CPU consumption.

Optimizing eBPF Code for Minimal Overhead

Netflix's commitment to low overhead instrumentation led them to develop an open-source tool called bpftop for measuring the overhead of eBPF code. Their profiling revealed that the instrumentation added less than 600 nanoseconds to each sched_* hook, demonstrating minimal performance impact.

Through extensive testing and experimentation, they identified several optimizations to minimize the overhead of their eBPF code:

  • BPF Map Optimization: They found that BPF_MAP_TYPE_HASH was the most performant for storing enqueued timestamps, while BPF_MAP_TYPE_TASK_STORAGE resulted in significant performance degradation.
  • BPF Helper Function Reduction: They optimized code by accessing task struct members directly, eliminating the need for BPF_CORE_READ helpers, which contributed to performance overhead.
  • Conditional Logic for Kernel Tasks: By implementing early exit conditions and conditional logic, they prevented unnecessary operations for kernel tasks, which were found to be largely irrelevant for their monitoring objectives.

Looking Ahead: The Future of eBPF and Infrastructure Observability

Netflix's experience underscores the transformative power of eBPF for infrastructure observability. Their work has not only enabled them to effectively manage noisy neighbor issues but also provided valuable insights into the Linux scheduler's behavior.

As eBPF adoption continues to grow, we can expect further advancements in infrastructure monitoring and management. Projects like sched_ext, which aims to revolutionize scheduling decisions based on workload needs, promise even more tailored and efficient resource management.

Conclusion

Netflix's journey in leveraging eBPF for noisy neighbor detection demonstrates the potential of this technology to enhance infrastructure observability and empower developers to proactively manage complex cloud environments. By continuously instrumenting the Linux scheduler, they have gained valuable insights into container behavior, refined CPU isolation strategies, and ultimately improved the user experience. This success story not only highlights the power of eBPF but also paves the way for a future where eBPF becomes an integral part of infrastructure management, offering unprecedented levels of control and visibility.

Posting Komentar

Posting Komentar