How to make sure Jaeger works well for your tracing needs

Where this fits in K8s strategy

Jaeger displays “tracing data” for distributed services. It highlights downtime/slow-load risk and errors.

Why it’s important

Helps tough task of tracking issues among 10s of services that may each have many sub-services


Let’s explore the basics before getting into the tactics for boosting Jaeger.

Origin story

Jaeger was founded in 2015 within the walls of Uber. Yes, that Uber. Yuri Shkuro created it to help engineers work out where issues were popping up.

This was important because Uber had a complex network of services. Many of these depended on other services as well as their own sub-services.

To the left: a glimpse of the services network that drives the Uber app. A large number of these services get triggered every time you request an Uber. (Source: Youtube, Jaeger Intro – Yuri Shkuro)

Chances of the whole request falling apart were high. Uber risked losing a ride fare if one or some of the component services failed or slowed down.

“In deep distributed systems, finding what is broken and where is often more difficult than why

— Yuri Skhuro, Founder & Maintainer, CNCF Jaeger

Jaeger helps us find out what services are experiencing issues and where. That’s useful to know. It can help engineers fix small issues before they snowball into serious ones.


Do you even need Jaeger?

You might be wondering whether you even need Jaeger. Your use case might not be as complex as Uber’s. It has a complex web of services and millions of requests per day.

Tracing is not an absolute must-have for simpler services setups. But it is handy for finding bottlenecks if you run more than a handful of services.

Also, imagine this situation. Your application suddenly gets a traffic spike and requests are not completing. How will you find the culprit fast enough to fix the issue?

One more stop and then we’ll start to cover how to optimise Jaeger for tracing.


How Jaeger works

See a simplified view of how Jaeger works below.

We’ll cover the highlighted terms in greater depth in the tactics section

Step 1



Jaeger Agent collects “span data” by observing UDP packets of services

Step 2



Data (service name, start time, duration) gets sent on to the Collector

Step 3



Collector sends data to 2 places: Analytics and Visual Dashboard


Now, let’s make Jaeger work well for your setup

Tailor Jaeger to your specific tracing needs with these tactics:

TACTIC #1 2 ways to install Jaeger Agent

Jaeger Agent can run in two distinct ways: as a daemon or sidecar.

Let’s compare them:

Setup Jaeger as daemonset

Mechanism: Jaeger Agent runs as a pod and collects data from all other pods within the same node

Useful for: single tenant or non-production clusters

Benefits: lower memory overhead, simpler setup

Risk: security risk if deployed on multi-tenant cluster

Setup Jaeger as sidecar

Mechanism: Jaeger Agent runs as a container alongside service container within every pod

Useful for: multi-tenant clusters, public cloud clusters

Benefits: granular control, higher security potential

Risk: more DevOps supervision required

TACTIC #2 Pick the right sampling method

Jaeger samples parts of UDP packets transmitted by services.

There are 2 sampling methods – each has its own benefits and downsides. Let’s explore:

Heads-based sampling

Also known as: upfront sampling

Mechanism: sampling decision is made prior to request completion

Useful for: high-throughput use cases, looking at aggregated data

Benefits: cheaper sampling method – lower network and storage overhead

Risk: potential to miss outlier requests due to less than 100% sampling

Work required: easy setup, supported by Jaeger SDKs

Config notes: sampling based on flip-of-coin or until certain rate is achieved

Tails-based sampling

Also known as: response sampling

Mechanism: sampling decision is made after the request has been completed

Useful for: catching anomalies in latency, failed requests

Benefits: more intelligent approach to looking at request data

Risk: temporary storage for all traces – more infra overhead, single node only

Work required: extra work – connect to a tool that supports tail-based sampling like Lightstep

Config notes: sampling based on latency criteria and tags

TACTIC #3 Prevent collector getting clogged

Jaeger’s collector holds data temporarily before it writes onto a database. This database is then queried by visual UI.

But a problem can arise: the collector can get clogged if the database can’t write fast enough.

Problem

  • Collector’s temp storage model becomes problematic when traffic spikes
  • Some data gets dropped so the collector can stay afloat from the flood of incoming request data
  • Your tracing may look patchy in areas because of the gaps in sampling data
  • Risk of missing failed or problematic requests if they were in the sampling that gets dropped

Solution

  • Consider asynchronous span ingestion technique to solve this problem
  • This means adding a few components between your collector and database:
    • Apache Kafka – real-time data streaming at scale
    • Apache Flink – processes Kafka data asynchronously
    • 2 jaeger components – jaeger-ingester and jaeger-indexer – push Flink output to storage
  • The collector does not get overloaded and tempted to dump data once these are in the mix

How to implement

Remember when you first heard Kubernetes terms like nodes, pods, sidecars, multitenant? Much confusion.

Same story with Apache Kafka and Flink. Lots of new jargon to learn that is beyond our high-level scope here.

But these links – accessed in order – might help you get started with your implementation:

Leave a Comment