Lab: Monitoring & Troubleshooting

Objective

Practice monitoring VergeOS infrastructure health, configuring alerts, analyzing system logs, and diagnosing common issues. By the end of this lab, you will be comfortable navigating the VergeOS monitoring tools and following a structured troubleshooting workflow.

Prerequisites

Completed Module 1: Architecture Fundamentals
Completed Module 4: Networking
Completed Module 5: Storage
Completed Module 9 reading (Dashboard, Alerts, Diagnostics, Escalation)
A running VergeOS cluster with admin access

Difficulty

Intermediate — Requires familiarity with the VergeOS UI and basic system administration concepts

Estimated Time

1.5 hours

Steps

Part 1: Dashboard Exploration

Familiarize yourself with the VergeOS monitoring dashboard.

Log into the VergeOS UI with administrator credentials
Navigate to the main dashboard and identify:
- Node status indicators (online, offline, maintenance)
- CPU, memory, and storage utilization graphs
- Network connectivity status
- Active alerts and notifications
Drill down into an individual node’s detail page
Review the cluster health overview and identify key metrics
Explore the storage pool status and verify all drives are healthy
Document the current resource utilization baseline for your cluster

Part 2: Alert Configuration

Set up alerts and notification rules.

Navigate to the alerts configuration section
Review the default alert rules and their thresholds
Create a custom alert rule for:
- High CPU utilization (>85% sustained for 5 minutes)
- Low storage capacity (less than 20% free space)
- Node connectivity loss
Configure a notification channel (email or syslog)
Test the alert notification by triggering a threshold (if possible in your lab environment)
Configure log forwarding to an external syslog server (or a local log collector)

Part 3: Log Analysis & Diagnostics

Practice analyzing system logs and using diagnostic tools.

Navigate to the system logs section
Filter logs by severity level (error, warning, info)
Search for specific events related to:
- VM operations (start, stop, migrate)
- Storage events (drive errors, rebalancing)
- Network events (link state changes)
Identify common error patterns and their likely causes
Use the built-in diagnostic tools to check:
- Storage subsystem health
- Network connectivity between nodes
- Service status across the cluster
Practice generating a diagnostic bundle for support escalation

Part 4: Troubleshooting Scenarios

Diagnose simulated issues using the tools you’ve learned.

Scenario A: Slow VM Performance — A user reports a VM is running slowly. Use the dashboard and logs to:
- Check the VM’s resource allocation and utilization
- Identify if the host node is overcommitted
- Check storage I/O latency
- Recommend a resolution
Scenario B: Network Connectivity Issue — A tenant reports they cannot reach external networks. Investigate:
- Tenant network configuration
- Virtual network layer connectivity
- Physical network status on the host nodes
- Identify the root cause and resolution
Scenario C: Storage Alert — The system generates a storage capacity warning. Determine:
- Which storage pool is affected
- What is consuming the most space
- Recommended actions (cleanup, expansion, or migration)

Verification

Your monitoring and troubleshooting lab is complete when you can answer yes to all of the following:

Successfully navigated the VergeOS dashboard and identified key health metrics
Created custom alert rules with appropriate thresholds
Configured at least one notification channel (email or syslog)
Filtered and searched system logs to find specific events
Used diagnostic tools to check storage, network, and service health
Worked through at least two troubleshooting scenarios and identified root causes
Generated a diagnostic bundle suitable for support escalation