Over the past few years, large language models (LLMs) have moved from research labs into the heart of enterprise AI strategies. Organizations are increasingly relying on these models to automate tasks like customer support, document processing, code generation, and knowledge retrieval. Their growing role in business-critical applications has made it essential to deploy them not just efficiently, but reliably and with full transparency into their runtime behavior.
Observability for LLMs isn’t optional—it’s the foundation for managing reliability, cost, safety, and continuous improvement. LLM workloads present unique challenges.
Red Hat OpenShift AI provides a strong foundation to run these workloads at scale, offering a flexible MLOps platform that integrates well with modern AI pipelines. Dynatrace brings a powerful suite of observability capabilities: end-to-end tracing, smart anomaly detection, and deep visualization, all of which are critical when dealing with the unpredictable nature of LLM inference workloads.
Leveraging vLLM for efficient inference
One particular LLM runtime gaining traction is vLLM, an efficient, high-performance inference engine designed to serve transformer-based models with optimized memory usage and throughput. It’s lightweight, scalable, and increasingly becoming the choice for real-time generative AI use cases.
In this article, we’ll walk through how to integrate Dynatrace with vLLM deployments running on OpenShift AI using OpenTelemetry collectors to extract relevant metrics and create dashboards that bring model behavior to life. Whether you’re an MLOps engineer, a platform architect, or simply responsible for productionizing AI, this guide aims to give you a clear, hands-on blueprint for observability done right.
View the GitHub project: RHEcosystemAppEng/dynatrace
Why observability for LLMs matters
Deploying large language models (LLMs) into production isn't just about getting responses—it's about ensuring consistency, performance, and trustworthiness. From tracking GPU utilization to understanding token-level throughput and managing inference latency, there's a wide surface area of metrics that need to be captured, contextualized, and acted upon. Without a clear view into these workloads, it's nearly impossible to operate them with confidence or optimize them for cost and performance.
What makes LLMs harder to monitor?
LLMs are stateful and resource-heavy. Each request can involve thousands of tokens, significant GPU compute, and memory usage that shifts with prompt complexity. Traditional CPU and memory metrics don’t capture this behavior.
Their outputs are probabilistic—the same prompt can yield different responses. This makes debugging difficult, especially when hallucinations or quality issues occur despite healthy system metrics.
LLMs are also sensitive to prompt phrasing and can combine external knowledge sources. As user inputs or retrieved data change, response quality can silently degrade.
Finally, LLM applications rely on complex pipelines involving prompt templates, retrieval systems, tool use, and chaining. Failures can happen anywhere, and observability is key to tracing what went wrong.
What you should be observing
Some of the most critical challenges in monitoring LLMs include:
- Token-level metrics: How many tokens are being processed per request? Are you hitting a context length limit or exhausting the quota unexpectedly?
- Inference latency: What’s the average latency per prompt? Are there tail latencies that affect user experience?
- GPU utilization: Is your hardware actually being used efficiently, or are you paying for idle cycles?
- Throughput and queueing: Are inference requests stacking up, or are they flowing through as expected?
- Failure modes: Are users getting partial outputs? Are timeouts happening silently?
- Output quality and safety: How do you detect hallucinations or nonsensical outputs? Drift? How are you monitoring for toxicity or prompt injection attempts that could compromise safety and trust?
Where observability solves real LLM problems
Observability helps optimize performance, control costs, and ensure responsible AI. It reveals how model settings and infrastructure impact throughput, tracks token usage to manage budget, and provides the visibility needed to monitor fairness, safety, and compliance.
A chatbot returning incomplete responses may be hitting max token limits, a problem that’s invisible without token-level metrics. If GPU utilization stays low after scaling across nodes, it could indicate suboptimal batch sizes or a mismatch between model size and hardware. Observability helps pinpoint these issues quickly, turning trial-and-error tuning into data-driven optimization.
Sustainability is also emerging as a key metric. With high GPU energy use, tracking carbon impact per inference is quickly becoming essential, especially in ESG-conscious or regulated environments.
Overview of the architecture
Figure 1 shows deployed vLLM via KServe with the OpenTelemetry collector sidecar. The collector receives telemetry data from the vLLM, processes it and sends it to the Dynatrace. Additionally, the collector can be configured to fork the data and send it to local observability stores deployed in the same cluster. However, this is not covered in the post. In this article, we are going to focus on sending the telemetry data to the Dynatrace platform.

Environment setup
This section outlines the prerequisites and steps required to configure observability for a model served on OpenShift using Dynatrace and OpenTelemetry.
Prerequisites
- A model deployed and served using KServe.
- Installed Red Hat build of OpenTelemetry Operator (how to install).
Step-by-step installation and configuration
Obtain Dynatrace endpoint and access token:
- Log in to your Dynatrace dashboard.
- Go to Settings and search for Access Tokens.
- Ensure Personal Access Token is enabled.
Important note:
Disable the Dynatrace API Token in the new format if required by your setup.
- From the menu (three dots on the right-hand side), select API and choose to Create a New Token.
- In the token configuration:
- Click on Go to Access Token in the Dynatrace menu.
- Select Generate New Token (Figure 2).
- Enable scopes for metrics, logs, traces, and all relevant ingest scopes.
- Generate and securely store the token, as you will not be able to access it again.

OpenTelemetry Collector configuration
The following is a sample OpenTelemetryCollector custom resource (CR) configured to collect and forward metrics from a KServe-deployed model to Dynatrace. The collector is deployed as a sidecar and it scrapes vLLM's Prometheus metrics endpoint. The Dynatrace requires metrics to be in the delta temporality; therefore it uses the cumulativetodelta processor, which transforms the Prometheus metrics from cumulative to delta temporality.
This configuration enables real-time observability of model performance directly in the Dynatrace dashboard.
High-level architecture
- Each model is deployed using KServe's InferenceService.
- A sidecar OpenTelemetryCollector is injected per model deployment.
- The sidecar scrapes Prometheus metrics from the vLLM process inside the pod (e.g.,
localhost:8000
). - It processes the data to match Dynatrace's requirements and forwards the metrics using OTLP over HTTP.
- This architecture enables real-time monitoring of token generation, latency, throughput, and more—within Dynatrace dashboards.
OpenTelemetry Collector as sidecar
Each model has its own collector instance, injected via the sidecar.opentelemetry.io/inject
annotation. The collector is configured in sidecar mode, which ensures it runs in the same pod as the model server.
Key collector features include:
- Receivers:
- prometheus: Scrapes from the model’s Prometheus metrics endpoint (
localhost:<port>/metrics
). otlp
: Listens for telemetry on 4317 (gRPC) and 4318 (HTTP).
- prometheus: Scrapes from the model’s Prometheus metrics endpoint (
- Processors:
memory_limiter
: Prevents out-of-memory errors.cumulativetodelta
: Required by Dynatrace to convert Prometheus metrics from cumulative to delta temporality.batch
: Batches exported data to reduce network overhead.
- Exporters:
otlphttp/dynatrace
: Sends metrics securely to Dynatrace using environment variables injected from a Kubernetes Secret.debug
: (Optional) Logs received telemetry data for troubleshooting.
The collector’s environment is securely configured via this Kubernetes secret:
apiVersion: v1
kind: Secret
metadata:
name: dynatrace-otc-secret
type: Opaque
stringData:
endpoint: https://<your-env>.live.dynatrace.com/api/v2/otlp
apiToken: dt0c01.YOUR_TOKEN_HERE
Deploying the sidecar with each model
The InferenceService resource includes an annotation to inject the sidecar per model. The annotation has to be added to the pod annotations. Here's a Helm-templated example:
annotations:
sidecar.opentelemetry.io/inject: <model-name>-otelsidecar
This triggers sidecar injection with the following OpenTelemetryCollector definition:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otelsidecar
spec:
mode: sidecar
env:
- name: DYNATRACE_ENDPOINT
valueFrom:
secretKeyRef:
name: dynatrace-otc-secret
key: endpoint
- name: DYNATRACE_API_TOKEN
valueFrom:
secretKeyRef:
name: dynatrace-otc-secret
key: apiToken
config:
exporters:
debug:
verbosity: detailed
otlphttp/dynatrace:
endpoint: ${DYNATRACE_ENDPOINT}
headers:
Authorization: Api-Token ${DYNATRACE_API_TOKEN}
processors:
batch:
send_batch_size: 100
timeout: 1s
cumulativetodelta: {}
memory_limiter:
check_interval: 5s
limit_percentage: 95
spike_limit_percentage: 25
receivers:
otlp:
protocols:
grpc:
endpoint: '0.0.0.0:4317'
http:
endpoint: '0.0.0.0:4318'
prometheus:
config:
scrape_configs:
- job_name: kserve
scrape_interval: 5s
static_configs:
- targets:
- 'localhost:8000'
service:
pipelines:
metrics:
exporters:
- otlphttp/dynatrace
- debug
processors:
- memory_limiter
- cumulativetodelta
- batch
receivers:
- prometheus
With this setup, each vLLM model deployed via KServe will automatically send detailed performance metrics to Dynatrace, helping engineering and DevOps teams monitor model latency, token throughput, and more—all within their existing observability stack.
The full working configuration can be found in the GitHub repository.
Instrumenting vLLM workloads
The vLLM server provides out-of-the box metrics. These metrics should provide understanding of the model's health and performance. Additionally, there is distributed tracing instrumentation that has to be explicitly installed. The metrics are exposed in the Prometheus format on the /metrics endpoint and trace data in the OpenTelemetry protocol (OTLP). The telemetry data can be collected with the OpenTelemetry collector. The precise collector configuration was covered in the previous section. For additional configuration information, refer to the Red Hat build of OpenTelemetry documentation.
Creating dashboards
To demonstrate how OpenTelemetry and Dynatrace can be used together for deep visibility into LLM inference workloads, let’s walk through a sample dashboard built for vLLM running on OpenShift AI.
The following dashboard captures a holistic view of the system’s performance, from infrastructure-level GPU metrics to application-level token throughput, and ties it all together through actionable visualizations.
Top-level metrics
These metrics offer an at-a-glance understanding of the system's overall performance and cost implications.
Operational overview
At the top, the dashboard highlights key metrics that operations teams care about:
- Estimated inference cost (USD): A running estimate of total compute cost over time, giving you a financial lens into your AI workloads.
- # of running requests and waiting requests: Helps track current demand and identify if requests are queuing up due to under-provisioned resources or inference slowness.
- Request swap rate: Indicates how often the system is rotating or evicting queued requests—useful for spotting saturation patterns.
- Average inference time: A high-level latency indicator that shows how long typical prompts take from submission to completion.
Service health and performance
This section dives deeper into core LLM performance indicators:
- Throughput (tokens/second): Measures how many tokens are being processed per second. It’s a key performance indicator for the vLLM engine and can help detect bottlenecks.
- Time to first token: Shows the delay between sending a prompt and receiving the first response token. High values here could indicate backend inefficiencies or cold starts.
- GPU usage: Tracks how effectively your GPU cache is being used. Flat or underutilized trends can help trigger cost optimization decisions.
- Prompt token count over time: Visualizes the size of user prompts being served. Sudden spikes can point to atypical usage or prompt injection attacks.
- E2E request latency: This end-to-end view tracks the full life cycle latency of requests, from API ingress to token generation completion.
Token dynamics and traffic analysis
The second half of the dashboard focuses on token-level trends and traffic patterns, as illustrated in Figures 3 and 4:
- Completion tokens over time: Mirrors the prompt token graph, giving you visibility into how long the model's completions typically are.
- Total inference requests: Offers volume analytics over time, helping detect bursty traffic patterns, peak usage windows, or periods of inactivity.
- Trace view: This unique hex view aggregates trace data across providers (e.g., OpenAI, VertexAI, Amazon, Ollama) to help identify workload distribution across model sources, anomalies, or platform-specific degradation.


Conclusion
As generative AI moves from experimental to essential, enterprises can no longer afford to treat observability as an afterthought. LLMs like those served through vLLM introduce unique operational challenges, from unpredictable token usage to variable GPU workloads and inference latency, that require purpose-built visibility solutions.
By combining OpenShift AI’s scalable MLOps platform with the extensibility of OpenTelemetry and the intelligent analytics capabilities of Dynatrace, teams can gain comprehensive insight into their model performance, resource utilization, and user experience. The approach outlined in this article doesn’t just help you monitor your models; it empowers you to operate them reliably, optimize them continuously, and ensure they meet business SLAs at scale.
This is just the beginning. As AI adoption grows and workloads become more distributed, real-time, and cost-sensitive, observability will become a competitive advantage. Investing in the right architecture now ensures you’re ready for what comes next.
Ready to explore these new features? Visit the redhat.com/observability and documentation pages to learn more and get started with the latest observability tools in OpenShift. Red Hat Developer's observability topic page also contains information to help you learn about and implement observability capabilities.
We value your feedback! Share your thoughts and suggestions using the Red Hat OpenShift feedback form.
Last updated: May 28, 2025