Daily Health Checks
Review node temperatures, drive error counters, and fabric status across all nodes. Address any Not Confirmed fabric NICs or increasing drive errors immediately.
VergeOS takes a single-pane-of-glass approach to infrastructure monitoring. Every component — compute, storage, networking, and hardware health — is visible from the built-in UI without any external monitoring tools. The system provides real-time metrics, historical trends, and event logs at every level: individual nodes, clusters, vSAN tiers, networks, VMs, and tenants.
This page covers the primary monitoring surfaces you will use daily to assess system health and troubleshoot issues.
The Nodes dashboard is your primary interface for monitoring individual physical (or virtual) servers in the environment. Navigate to it via Infrastructure → Nodes, then select a specific node.
At the top of the node dashboard, you will find key status fields:
| Field | Description |
|---|---|
| Status | Current operational state — Running or Offline |
| Maintenance Mode | Whether the node is in maintenance state (workloads migrated away) |
| Last Powered On | Timestamp of the most recent boot |
| IPMI Status | Status of the Intelligent Platform Management Interface |
| IPMI Network Address | BMC/iDRAC/iLO management IP for remote access |
| System Version | VergeOS version breakdown — OS, vSAN, Appserver, and Kernel versions |
The dashboard also displays the physical hardware profile:
The CPU usage graph is one of the most frequently consulted metrics. It provides real-time and historical trend visualization with multiple breakdown categories:
| Metric | What It Shows |
|---|---|
| Total CPU | Aggregate CPU utilization across all cores |
| Core Peak | The single highest-utilized core (helps identify single-threaded bottlenecks) |
| User | Time spent in user-space processes |
| System | Time spent in kernel-space operations |
| IO Wait | Time the CPU is idle waiting for I/O operations to complete |
| VM Usage | CPU consumed by virtual machines running on this node |
| IRQ | Time spent handling hardware and software interrupts |
Below the CPU graph, the dashboard presents quick-reference metric cards:
The lower section of the node dashboard provides detailed views of every physical hardware component.
All physical drives attached to the node are listed with comprehensive health data:
| Column | Purpose |
|---|---|
| Status | Online / Offline indicator |
| Name | Device identifier (e.g., nvme0n1, sda) |
| Model | Manufacturer and model number |
| Tier | vSAN storage tier assignment (set at install time) |
| vSAN Drive ID | Unique identifier within the vSAN |
| Firmware | Current drive firmware version |
| Bus | Hardware bus connection type (NVMe, SATA, SAS) |
| Usage | Capacity utilization with visual progress bar |
| Repairing | Whether the drive is currently being rebuilt |
| Read/Write Errors | Error counters for proactive health monitoring |
You can click any drive to access its S.M.A.R.T. diagnostics — reallocated sectors, temperature, power-on hours, wear leveling, and other predictive failure indicators.
Every NIC in the node is displayed with operational and fabric status:
enp2s0f0)| Section | What It Shows |
|---|---|
| Memory Modules | Installed RAM — module count, capacity, type, and specifications |
| LLDP Neighbors | Link Layer Discovery Protocol data — connected switch, port mappings, network topology |
| PCI Devices | All PCI/PCIe devices with bus assignments and passthrough availability |
| SR-IOV NIC Devices | Virtual function count and assignment status for SR-IOV capable NICs |
| NVIDIA vGPU Devices | GPU model, vGPU profiles available, and allocation status |
| USB Devices | Connected USB devices with passthrough capability |
The core fabric is the backbone network connecting all VergeOS nodes. Monitoring fabric health is critical because fabric degradation impacts vSAN replication, VM live migration, and inter-node communication.
On each node’s NIC table, look for the Fabric Status column:
Every node dashboard includes an Event Logs section that displays system events scoped to that node. Events are classified by severity level:
| Level | Description | Examples |
|---|---|---|
| Error | Critical issues requiring immediate attention | Drive failure, node offline, vSAN degraded |
| Warning | Conditions that may lead to problems if unaddressed | Temperature threshold exceeded, drive errors increasing |
| Info | Normal operational events | Power state changes, maintenance mode transitions, VM migrations |
"Core has reached warning temperature '96 / 95'" indicates a CPU core exceeded the configured threshold (default 95°C)Each log entry includes the timestamp, source (e.g., node1), and a detailed message. Click View More to access the full log history.
The left-side menu on the node dashboard provides essential management operations:
Always enable maintenance mode before performing hardware changes or system updates. When you place a node in maintenance mode, VergeOS automatically live-migrates all running workloads to other nodes in the cluster, ensuring zero downtime for VMs and services.
Provides direct console access to the node via IPMI/iDRAC/iLO. Use this for troubleshooting scenarios where the VergeOS UI is unreachable or you need BIOS-level access.
While node dashboards show individual server health, cluster views provide aggregate resource utilization across all nodes in a cluster.
Navigate to System → Clusters to see:
Cluster views are essential for capacity planning — they help you identify when a cluster is approaching resource limits and when it is time to scale out with additional nodes.
The vSAN status is accessible from the main dashboard by clicking the vSAN Tiers count box, or via System → vSAN.
Key indicators include:
working=true) or degradedNetwork health is monitored from Networks in the main navigation. For each network (external, internal, DMZ, tenant networks), you can view:
Each node dashboard includes a Running Machines section showing all active workloads:
| Column | Description |
|---|---|
| Status | Running state indicator |
| Type | Virtual Machine, vNet Container, or system service |
| Name | Machine identifier |
| CPU Cores | Number of assigned cores |
| CPU Usage | Current processor utilization percentage |
| RAM | Allocated memory with utilization percentage |
| Last Started | Timestamp of when the workload was started |
Common machine types include VMs, vNet containers (network services), and system services (NAS, DMZ, External network, etc.).
Daily Health Checks
Review node temperatures, drive error counters, and fabric status across all nodes. Address any Not Confirmed fabric NICs or increasing drive errors immediately.
Use Maintenance Mode
Always enable maintenance mode before hardware changes, firmware updates, or system updates. This ensures workloads are live-migrated away before you touch the node.
Monitor vSAN Capacity
Keep tier utilization below 85% to maintain performance headroom. Configure subscription alerts (covered in the next section) to notify you before reaching throttling thresholds.
Review Logs Regularly
Periodically check event logs for temperature warnings, drive errors, and unexpected state changes. Catching issues early prevents cascading failures.