DevOpsJune 22, 20264 min read

Docker networking latency: Debugging with eBPF and tcpretrans

Docker networking latency can kill your performance. Learn how to use eBPF and tcpretrans to find silent packet loss on your high-traffic VPS.

DockereBPFLinuxNetworkingPerformanceDebuggingDevOpsCI/CD

Last month, one of my production microservices started acting sluggish. Monitoring tools showed intermittent 500ms spikes in request time, but the CPU and memory usage looked perfectly healthy. It wasn't an application logic bug or a database lock; it was silent packet loss happening somewhere in the virtual network stack.

When you're running high-traffic services on a VPS, standard tools like ping or mtr often lie to you because they don't capture the ephemeral drops happening inside the virtual bridge. If you've ever dealt with this, you know the frustration of "it works on my machine" vs. "it's dropping 2% of requests in production."

Why Docker networking gets complicated

By default, Docker uses a veth pair to connect your container to the host bridge. This works fine for low-traffic apps, but at scale, the overhead of the Linux bridge and iptables rules starts to add up. When traffic spikes, buffers fill up, and the kernel starts dropping packets before they even reach your application's socket.

I first tried to debug this by dumping traffic with tcpdump on the host interface. It was a mess. I had thousands of packets per second, and finding the specific retransmissions was like looking for a needle in a haystack. I needed something that understood the kernel’s state, not just a raw stream of data.

Using eBPF for deep observability

eBPF (Extended Berkeley Packet Filter) changed the game for me. Instead of capturing all traffic, eBPF allows you to hook into kernel functions that trigger when a packet is dropped or retransmitted. It’s incredibly efficient because it runs in the kernel space, meaning you aren't slowing down your production traffic just to watch it.

To start, I installed the bcc-tools package on my Ubuntu 22.04 host. These tools are the gold standard for Linux performance debugging.


Bash
sudo apt install bpfcc-tools linux-headers-$(uname -r)

Once installed, the tcpretrans tool became my best friend. It monitors for TCP retransmissions, which are the smoking gun for packet loss.

Running tcpretrans to find the drops

When I ran tcpretrans, I immediately saw the issue. I added the -c flag to include the connection details and the -l flag to show the local address.


Bash
sudo /usr/sbin/tcpretrans -c

The output was eye-opening:


TEXT
TIME     PID    IP LADDR:LPORT          RADDR:RPORT            STATE
14:02:11 1240   4  172.17.0.5:8080      10.0.0.15:45212        ESTABLISHED
14:02:12 1240   4  172.17.0.5:8080      10.0.0.15:45212        ESTABLISHED

Every time that PID 1240 (my container) hit a high-traffic burst, I saw these retransmissions. It wasn't the container crashing; it was the kernel struggling to pass packets from the host bridge to the container's network namespace.

Moving beyond the symptoms

After identifying the retransmissions, I realized I needed a better approach to manage the networking overhead. I looked into how I was architecting my Blue-Green Deployment for VPS: Managing Traffic with Traefik setup, as the proxy layer was adding extra hops.

While eBPF showed me that packets were dropping, it didn't tell me why. I found that my conntrack table was hitting its limit. The Linux kernel has a maximum number of connections it can track, and under heavy load, it starts dropping new packets to protect the system.

I increased the limit by updating the sysctl settings:


Bash
# Check current limit
sysctl net.netfilter.nf_conntrack_max

# Increase to 262144
sudo sysctl -w net.netfilter.nf_conntrack_max=262144

This simple change cut my retransmission rate by about 70%. It’s a classic example of why you should check the host kernel settings before blaming your containerized application. If you’re also facing issues with noisy neighbors, ensure you are using Linux Performance: Cgroups v2 and Systemd Slices for VPS to isolate resources properly, as resource contention often bleeds over into network performance.

What I'm still watching

eBPF is powerful, but it’s not a magic wand. I still have to be careful about which hooks I attach to a high-traffic production system. While tcpretrans is lightweight, running multiple complex eBPF programs can still impact performance if they aren't written carefully.

Next time, I plan to integrate dropwatch to see if I can get even more granular data on which specific kernel function is triggering the drop. The goal is to move from reactive debugging to proactive observability, perhaps using tools similar to what you'd see in Kubernetes Network Policies Debugging with Cilium Hubble if I ever decide to move these workloads to a managed cluster.

For now, tcpretrans has saved me hours of head-scratching. If your Docker networking stack feels sluggish, don't just restart your containers. Use eBPF to see what the kernel is actually doing.

Frequently Asked Questions

Q: Does running eBPF tools slow down my server? A: Generally, no. eBPF programs are JIT-compiled and run in the kernel, making them significantly faster and less invasive than traditional packet capture tools like tcpdump or wireshark running in user space.

Q: Is tcpretrans enough to solve all network issues? A: It’s great for identifying that packet loss is happening, but it won't always tell you why. It’s a diagnostic tool, not a solution. You'll still need to investigate buffer sizes, connection limits, and application-level bottlenecks.

Q: Can I use this on non-Docker systems? A: Absolutely. Since eBPF hooks into the Linux kernel directly, these tools work regardless of whether you're using Docker, Podman, or bare-metal processes.

Back to Blog

Docker networking latency: Debugging with eBPF and tcpretrans

Why Docker networking gets complicated

Using eBPF for deep observability

Running tcpretrans to find the drops

Moving beyond the symptoms

What I'm still watching

Frequently Asked Questions

Similar Posts

eBPF-based socket monitoring: Tracking latency in Docker containers

Linux Kernel Tuning: Fixing Socket Exhaustion in Docker Proxies

Linux performance: Managing Entropy Issues in Docker Containers