Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Implement LLM observability with Dynatrace on OpenShift AI

May 21, 2025
Pavol Loffay Sally O'Malley Twinkll Sisodia
Related topics:
Artificial intelligenceKubernetesObservability
Related products:
Red Hat AIRed Hat OpenShift AIRed Hat OpenShift

Share:

    Over the past few years, large language models (LLMs) have moved from research labs into the heart of enterprise AI strategies. Organizations are increasingly relying on these models to automate tasks like customer support, document processing, code generation, and knowledge retrieval. Their growing role in business-critical applications has made it essential to deploy them not just efficiently, but reliably and with full transparency into their runtime behavior. 

    Observability for LLMs isn’t optional—it’s the foundation for managing reliability, cost, safety, and continuous improvement. LLM workloads present unique challenges.

    Red Hat OpenShift AI provides a strong foundation to run these workloads at scale, offering a flexible MLOps platform that integrates well with modern AI pipelines. Dynatrace brings a powerful suite of observability capabilities: end-to-end tracing, smart anomaly detection, and deep visualization, all of which are critical when dealing with the unpredictable nature of LLM inference workloads.

    Leveraging vLLM for efficient inference

    One particular LLM runtime gaining traction is vLLM, an efficient, high-performance inference engine designed to serve transformer-based models with optimized memory usage and throughput. It’s lightweight, scalable, and increasingly becoming the choice for real-time generative AI use cases.

    In this article, we’ll walk through how to integrate Dynatrace with vLLM deployments running on OpenShift AI using OpenTelemetry collectors to extract relevant metrics and create dashboards that bring model behavior to life. Whether you’re an MLOps engineer, a platform architect, or simply responsible for productionizing AI, this guide aims to give you a clear, hands-on blueprint for observability done right.

    View the GitHub project: RHEcosystemAppEng/dynatrace

    Why observability for LLMs matters

    Deploying large language models (LLMs) into production isn't just about getting responses—it's about ensuring consistency, performance, and trustworthiness. From tracking GPU utilization to understanding token-level throughput and managing inference latency, there's a wide surface area of metrics that need to be captured, contextualized, and acted upon. Without a clear view into these workloads, it's nearly impossible to operate them with confidence or optimize them for cost and performance.

    What makes LLMs harder to monitor?

    LLMs are stateful and resource-heavy. Each request can involve thousands of tokens, significant GPU compute, and memory usage that shifts with prompt complexity. Traditional CPU and memory metrics don’t capture this behavior.

    Their outputs are probabilistic—the same prompt can yield different responses. This makes debugging difficult, especially when hallucinations or quality issues occur despite healthy system metrics.

    LLMs are also sensitive to prompt phrasing and can combine external knowledge sources. As user inputs or retrieved data change, response quality can silently degrade.

    Finally, LLM applications rely on complex pipelines involving prompt templates, retrieval systems, tool use, and chaining. Failures can happen anywhere, and observability is key to tracing what went wrong.

    What you should be observing

    Some of the most critical challenges in monitoring LLMs include:

    • Token-level metrics: How many tokens are being processed per request? Are you hitting a context length limit or exhausting the quota unexpectedly?
    • Inference latency: What’s the average latency per prompt? Are there tail latencies that affect user experience?
    • GPU utilization: Is your hardware actually being used efficiently, or are you paying for idle cycles?
    • Throughput and queueing: Are inference requests stacking up, or are they flowing through as expected?
    • Failure modes: Are users getting partial outputs? Are timeouts happening silently?
    • Output quality and safety: How do you detect hallucinations or nonsensical outputs? Drift? How are you monitoring for toxicity or prompt injection attempts that could compromise safety and trust?

    Where observability solves real LLM problems

    Observability helps optimize performance, control costs, and ensure responsible AI. It reveals how model settings and infrastructure impact throughput, tracks token usage to manage budget, and provides the visibility needed to monitor fairness, safety, and compliance.

    A chatbot returning incomplete responses may be hitting max token limits, a problem that’s invisible without token-level metrics. If GPU utilization stays low after scaling across nodes, it could indicate suboptimal batch sizes or a mismatch between model size and hardware. Observability helps pinpoint these issues quickly, turning trial-and-error tuning into data-driven optimization.

    Sustainability is also emerging as a key metric. With high GPU energy use, tracking carbon impact per inference is quickly becoming essential, especially in ESG-conscious or regulated environments.

    Overview of the architecture

    Figure 1 shows deployed vLLM via KServe with the OpenTelemetry collector sidecar. The collector receives telemetry data from the vLLM, processes it and sends it to the Dynatrace. Additionally, the collector can be configured to fork the data and send it to local observability stores deployed in the same cluster. However, this is not covered in the post. In this article, we are going to focus on sending the telemetry data to the Dynatrace platform.

    Flowchart depicting deployed vLLM via KServe with the OpenTelemetry collector sidecar.
    Figure 1: Deployed vLLM via KServe with the OpenTelemetry collector sidecar.

    Environment setup

    This section outlines the prerequisites and steps required to configure observability for a model served on OpenShift using Dynatrace and OpenTelemetry.

    Prerequisites

    • A model deployed and served using KServe.
    • Installed Red Hat build of OpenTelemetry Operator (how to install).

    Step-by-step installation and configuration

    1. Obtain Dynatrace endpoint and access token:

      1. Log in to your Dynatrace dashboard.
      2. Go to Settings and search for Access Tokens.
      3. Ensure Personal Access Token is enabled.

      Important note:

      Disable the Dynatrace API Token in the new format if required by your setup.

    2. From the menu (three dots on the right-hand side), select API and choose to Create a New Token.
    3. In the token configuration:
      1. Click on Go to Access Token in the Dynatrace menu.
      2. Select Generate New Token (Figure 2).
      3. Enable scopes for metrics, logs, traces, and all relevant ingest scopes.
    4. Generate and securely store the token, as you will not be able to access it again.
    Access tokens page in Dynatrace.
    Figure 2: Access tokens page in Dynatrace.

    OpenTelemetry Collector configuration

    The following is a sample OpenTelemetryCollector custom resource (CR) configured to collect and forward metrics from a KServe-deployed model to Dynatrace. The collector is deployed as a sidecar and it scrapes vLLM's Prometheus metrics endpoint. The Dynatrace requires metrics to be in the delta temporality; therefore it uses the cumulativetodelta processor, which transforms the Prometheus metrics from cumulative to delta temporality. 

    This configuration enables real-time observability of model performance directly in the Dynatrace dashboard.

    High-level architecture

    • Each model is deployed using KServe's InferenceService.
    • A sidecar OpenTelemetryCollector is injected per model deployment.
    • The sidecar scrapes Prometheus metrics from the vLLM process inside the pod (e.g., localhost:8000).
    • It processes the data to match Dynatrace's requirements and forwards the metrics using OTLP over HTTP.
    • This architecture enables real-time monitoring of token generation, latency, throughput, and more—within Dynatrace dashboards.

    OpenTelemetry Collector as sidecar

    Each model has its own collector instance, injected via the sidecar.opentelemetry.io/inject annotation. The collector is configured in sidecar mode, which ensures it runs in the same pod as the model server.

    Key collector features include:

    • Receivers:
      • prometheus: Scrapes from the model’s Prometheus metrics endpoint (localhost:<port>/metrics).
      • otlp: Listens for telemetry on 4317 (gRPC) and 4318 (HTTP).
    • Processors:
      • memory_limiter: Prevents out-of-memory errors.
      • cumulativetodelta: Required by Dynatrace to convert Prometheus metrics from cumulative to delta temporality.
      • batch: Batches exported data to reduce network overhead.
    • Exporters:
      • otlphttp/dynatrace: Sends metrics securely to Dynatrace using environment variables injected from a Kubernetes Secret.
      • debug: (Optional) Logs received telemetry data for troubleshooting.

    The collector’s environment is securely configured via this Kubernetes secret:

    apiVersion: v1
    kind: Secret
    metadata:
      name: dynatrace-otc-secret
    type: Opaque
    stringData:
      endpoint: https://<your-env>.live.dynatrace.com/api/v2/otlp
      apiToken: dt0c01.YOUR_TOKEN_HERE

    Deploying the sidecar with each model

    The InferenceService resource includes an annotation to inject the sidecar per model. The annotation has to be added to the pod annotations. Here's a Helm-templated example:

    annotations:
      sidecar.opentelemetry.io/inject: <model-name>-otelsidecar

    This triggers sidecar injection with the following OpenTelemetryCollector definition:

    apiVersion: opentelemetry.io/v1beta1
    kind: OpenTelemetryCollector
    metadata:
      name: otelsidecar
    spec:
      mode: sidecar
      env:
        - name: DYNATRACE_ENDPOINT
          valueFrom:
            secretKeyRef:
              name: dynatrace-otc-secret
              key: endpoint
        - name: DYNATRACE_API_TOKEN
          valueFrom:
            secretKeyRef:
              name: dynatrace-otc-secret
              key: apiToken
      config:
        exporters:
          debug:
            verbosity: detailed
          otlphttp/dynatrace:
            endpoint: ${DYNATRACE_ENDPOINT}
            headers:
              Authorization: Api-Token ${DYNATRACE_API_TOKEN}
        processors:
          batch:
            send_batch_size: 100
            timeout: 1s
          cumulativetodelta: {}
          memory_limiter:
            check_interval: 5s
            limit_percentage: 95
            spike_limit_percentage: 25
        receivers:
          otlp:
            protocols:
              grpc:
                endpoint: '0.0.0.0:4317'
              http:
                endpoint: '0.0.0.0:4318'
          prometheus:
            config:
              scrape_configs:
                - job_name: kserve
                  scrape_interval: 5s
                  static_configs:
                    - targets:
                        - 'localhost:8000'
        service:
          pipelines:
            metrics:
              exporters:
                - otlphttp/dynatrace
                - debug
              processors:
                - memory_limiter
                - cumulativetodelta
                - batch
              receivers:
                - prometheus

    With this setup, each vLLM model deployed via KServe will automatically send detailed performance metrics to Dynatrace, helping engineering and DevOps teams monitor model latency, token throughput, and more—all within their existing observability stack.

    The full working configuration can be found in the GitHub repository.

    Instrumenting vLLM workloads

    The vLLM server provides out-of-the box metrics. These metrics should provide understanding of the model's health and performance. Additionally, there is distributed tracing instrumentation that has to be explicitly installed. The metrics are exposed in the Prometheus format on the /metrics endpoint and trace data in the OpenTelemetry protocol (OTLP). The telemetry data can be collected with the OpenTelemetry collector. The precise collector configuration was covered in the previous section. For additional configuration information, refer to the Red Hat build of OpenTelemetry documentation.

    Creating dashboards

    To demonstrate how OpenTelemetry and Dynatrace can be used together for deep visibility into LLM inference workloads, let’s walk through a sample dashboard built for vLLM running on OpenShift AI.

    The following dashboard captures a holistic view of the system’s performance, from infrastructure-level GPU metrics to application-level token throughput, and ties it all together through actionable visualizations.

    Top-level metrics

    These metrics offer an at-a-glance understanding of the system's overall performance and cost implications.

    Operational overview

    At the top, the dashboard highlights key metrics that operations teams care about:

    • Estimated inference cost (USD): A running estimate of total compute cost over time, giving you a financial lens into your AI workloads.
    • # of running requests and waiting requests: Helps track current demand and identify if requests are queuing up due to under-provisioned resources or inference slowness.
    • Request swap rate: Indicates how often the system is rotating or evicting queued requests—useful for spotting saturation patterns.
    • Average inference time: A high-level latency indicator that shows how long typical prompts take from submission to completion.

    Service health and performance

    This section dives deeper into core LLM performance indicators:

    • Throughput (tokens/second): Measures how many tokens are being processed per second. It’s a key performance indicator for the vLLM engine and can help detect bottlenecks.
    • Time to first token: Shows the delay between sending a prompt and receiving the first response token. High values here could indicate backend inefficiencies or cold starts.
    • GPU usage: Tracks how effectively your GPU cache is being used. Flat or underutilized trends can help trigger cost optimization decisions.
    • Prompt token count over time: Visualizes the size of user prompts being served. Sudden spikes can point to atypical usage or prompt injection attacks.
    • E2E request latency: This end-to-end view tracks the full life cycle latency of requests, from API ingress to token generation completion.

    Token dynamics and traffic analysis

    The second half of the dashboard focuses on token-level trends and traffic patterns, as illustrated in Figures 3 and 4:

    • Completion tokens over time: Mirrors the prompt token graph, giving you visibility into how long the model's completions typically are.
    • Total inference requests: Offers volume analytics over time, helping detect bursty traffic patterns, peak usage windows, or periods of inactivity.
    • Trace view: This unique hex view aggregates trace data across providers (e.g., OpenAI, VertexAI, Amazon, Ollama) to help identify workload distribution across model sources, anomalies, or platform-specific degradation.
    OpenShift AI VLLM model performance dashboard showing estimated inference cost, running requests, GPU usage, and latency metrics over time.
    Figure 3: OpenShift AI VLLM model performance dashboard showing estimated inference cost, running requests, GPU usage, and latency metrics over time.
    An extended view of the performance metrics dashboard for OpenShift AI VLLM models, including request rates, completion token trends, total inference requests, and distributed trace visualization.
    Figure 4: An extended view of the performance metrics dashboard for OpenShift AI VLLM models, including request rates, completion token trends, total inference requests, and distributed trace visualization.

    Conclusion

    As generative AI moves from experimental to essential, enterprises can no longer afford to treat observability as an afterthought. LLMs like those served through vLLM introduce unique operational challenges, from unpredictable token usage to variable GPU workloads and inference latency, that require purpose-built visibility solutions.

    By combining OpenShift AI’s scalable MLOps platform with the extensibility of OpenTelemetry and the intelligent analytics capabilities of Dynatrace, teams can gain comprehensive insight into their model performance, resource utilization, and user experience. The approach outlined in this article doesn’t just help you monitor your models; it empowers you to operate them reliably, optimize them continuously, and ensure they meet business SLAs at scale.

    This is just the beginning. As AI adoption grows and workloads become more distributed, real-time, and cost-sensitive, observability will become a competitive advantage. Investing in the right architecture now ensures you’re ready for what comes next.

    Ready to explore these new features? Visit the redhat.com/observability and documentation pages to learn more and get started with the latest observability tools in OpenShift. Red Hat Developer's observability topic page also contains information to help you learn about and implement observability capabilities.

    We value your feedback! Share your thoughts and suggestions using the Red Hat OpenShift feedback form.

    Last updated: May 28, 2025

    Related Posts

    • Introducing Podman AI Lab: Developer tooling for working with LLMs

    • LLMs and Red Hat Developer Hub: How to catalog AI assets

    • TrustyAI Detoxify: Guardrailing LLMs during training

    • How to fine-tune Llama 3.1 with Ray on OpenShift AI

    • Generative AI large language model prompt patterns: Tips for developers

    • Accelerate model training on OpenShift AI with NVIDIA GPUDirect RDMA

    Recent Posts

    • How to run AI models in cloud development environments

    • How Trilio secures OpenShift virtual machines and containers

    • How to implement observability with Node.js and Llama Stack

    • How to encrypt RHEL images for Azure confidential VMs

    • How to manage RHEL virtual machines with Podman Desktop

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue