318 lines
14 KiB
Markdown
318 lines
14 KiB
Markdown
# DATTO Backup Performance Analysis Report
|
|
|
|
**Prepared for:** Management Review
|
|
**Date:** December 17, 2025
|
|
**Prepared by:** IT Infrastructure Team
|
|
**Subject:** Investigation of Slow VMware VM Backup Performance to DATTO Appliance
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
An investigation was conducted to determine the root cause of extremely slow backup speeds (2-5 MB/s) when backing up VMware virtual machines to the DATTO backup appliance. Despite having 10Gbps network infrastructure capable of 1,000+ MB/s throughput, backups are completing at less than 1% of network capacity.
|
|
|
|
**Key Finding:** The network infrastructure (HP switch, cabling, VLANs) has been ruled out as the cause. The bottleneck has been identified as the DATTO backup agent software running inside the Windows virtual machines, specifically the MercuryFTP protocol used for data transfer.
|
|
|
|
**Recommendation:** Engage DATTO support with the evidence documented in this report to resolve the software-level performance issue.
|
|
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
| Metric | Expected | Actual | Gap |
|
|
|--------|----------|--------|-----|
|
|
| Network Capacity | 10 Gbps (1,250 MB/s) | - | - |
|
|
| Practical Throughput | 100-500 MB/s | 2-5 MB/s | **99% under capacity** |
|
|
| 8TB File Server Backup | 4-8 hours | 24-48+ hours | 6-12x longer |
|
|
|
|
The slow backup speeds are causing:
|
|
- Extended backup windows overlapping with business hours
|
|
- Incomplete backup jobs
|
|
- Increased risk of data loss due to stale recovery points
|
|
|
|
---
|
|
|
|
## Infrastructure Overview
|
|
|
|
### Network Topology
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ │
|
|
│ HP 5406R zl2 Switch │
|
|
│ (10Gbps Infrastructure) │
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ CMIFS02 │ │ DATTOBU02 │ │
|
|
│ │ File Server │ │ Backup │ │
|
|
│ │ 8.7 TB │ │ Appliance │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ │
|
|
│ │ │ │
|
|
│ VLAN 212 VLAN 250 │
|
|
│ (FileServer) (IT-Management) │
|
|
│ │ │ │
|
|
│ ┌──────┴───────┐ ┌──────┴───────┐ │
|
|
│ │ Port F2 │ │ Port E5 │ │
|
|
│ │ 10 Gbps │ │ 10 Gbps │ │
|
|
│ │ Status: UP │ │ Status: UP │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ │
|
|
│ │ │ │
|
|
│ │ ┌──────────────┐ │ │
|
|
│ │ │ Port A20 │ │ │
|
|
│ └────────►│ 1 Gbps │◄────────────────┘ │
|
|
│ │ Router │ │
|
|
│ │ (Inter-VLAN) │ │
|
|
│ └──────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Device Identification
|
|
|
|
| Device | MAC Address | Switch Port | Speed | VLAN | Status |
|
|
|--------|-------------|-------------|-------|------|--------|
|
|
| CMIFS02 (File Server) | 00:50:56:8F:35:77 | F2 | 10 Gbps | 212 | Up |
|
|
| DATTOBU02 (Backup) | 6C:92:CF:17:BD:20 | E5 | 10 Gbps | 250 | Up |
|
|
| Router/Firewall | Multiple | A20 | 1 Gbps | Multi | Up |
|
|
|
|
---
|
|
|
|
## Investigation Results
|
|
|
|
### 1. HP Switch Analysis - PASSED
|
|
|
|
The HP 5406R zl2 switch was thoroughly analyzed and **cleared of any issues**.
|
|
|
|
#### System Health
|
|
| Metric | Value | Assessment |
|
|
|--------|-------|------------|
|
|
| Uptime | 242 days | Stable |
|
|
| CPU Utilization | 0% | Excellent |
|
|
| Memory Free | 72% | Excellent |
|
|
| Packet Buffers Missed | **0** | No packet drops |
|
|
|
|
#### Port Status
|
|
| Port | Device | Link Speed | Errors | Drops |
|
|
|------|--------|------------|--------|-------|
|
|
| E5 | DATTO Appliance | 10 Gbps | None | None |
|
|
| F2 | VMware ESXi Host | 10 Gbps | None | None |
|
|
| A20 | Router | 1 Gbps | None | None |
|
|
|
|
#### Configuration Review
|
|
| Setting | Configuration | Impact on Backups |
|
|
|---------|--------------|-------------------|
|
|
| QoS / Rate Limiting | None configured | No throttling |
|
|
| Port Security | No restrictions | No blocking |
|
|
| Spanning Tree | Disabled | No blocked ports |
|
|
| Broadcast Limits | None (0) | No limits |
|
|
| Flow Control | Off (normal) | No impact |
|
|
|
|
**Conclusion:** Switch is operating normally with zero packet loss and no throttling mechanisms.
|
|
|
|
---
|
|
|
|
### 2. VMware Performance Analysis - PASSED
|
|
|
|
Real-time performance monitoring was conducted during an active backup using the vCenter Performance API.
|
|
|
|
#### During Active Backup (CMIFS01)
|
|
| Metric | Value | Assessment |
|
|
|--------|-------|------------|
|
|
| Disk Read Speed | 53-76 MB/s | Good - VM reading data quickly |
|
|
| Disk Latency | 2 ms | Excellent - no storage bottleneck |
|
|
| CPU Usage | <10% | Good - not CPU bound |
|
|
| **Network TX** | **0.4-0.5 MB/s** | **BOTTLENECK IDENTIFIED** |
|
|
|
|
#### Historical Analysis (30 Days - CMIFS02)
|
|
| Metric | Average | Maximum | Assessment |
|
|
|--------|---------|---------|------------|
|
|
| CPU Usage | 5.7% | 10.4% | No issues |
|
|
| Disk Latency | 1.5 ms | 15 ms | Excellent |
|
|
| Memory Usage | Normal | Normal | No issues |
|
|
|
|
**Critical Finding:** The VM reads from disk at **76 MB/s** but only transmits **0.5 MB/s** to the network. This is a **150:1 ratio** indicating the bottleneck is inside the VM, not the network.
|
|
|
|
---
|
|
|
|
### 3. DATTO Appliance Analysis - ISSUES FOUND
|
|
|
|
Review of DATTO appliance logs revealed multiple problems:
|
|
|
|
| Issue | Description | Severity |
|
|
|-------|-------------|----------|
|
|
| Zpool Capacity Exceeded | Storage pool at or near capacity | High |
|
|
| High CPU Load | "Load average exceeds 2x number of cores" | High |
|
|
| HIR Failures | "Failed to copy bootmgfw.efi" on Windows Server 2025 | Medium |
|
|
| Backups Paused | Some agents showing "paused indefinitely" | High |
|
|
|
|
#### DATTO Backup Method
|
|
The DATTO appliance is using **in-guest Windows agent backup** with **MercuryFTP protocol** (TLS-encrypted proprietary transfer). This is NOT using VMware-native backup APIs (VADP).
|
|
|
|
Example from DATTO agent log:
|
|
```
|
|
Transport: MercuryFTP (TLS)
|
|
Backup Speed: 0.57 MB/s
|
|
```
|
|
|
|
---
|
|
|
|
## Root Cause Analysis
|
|
|
|
### Eliminated Causes
|
|
|
|
| Potential Cause | Evidence | Status |
|
|
|-----------------|----------|--------|
|
|
| HP Switch | 0% CPU, 0 dropped packets, 10Gbps links up | **Eliminated** |
|
|
| Network Cabling | All ports showing 10GigFD negotiation | **Eliminated** |
|
|
| VLAN Configuration | Correct tagging, routing functional | **Eliminated** |
|
|
| VMware Storage | 2ms latency, 76 MB/s read speed | **Eliminated** |
|
|
| VMware CPU | <10% utilization during backup | **Eliminated** |
|
|
| ESXi Host | 10Gbps uplinks, no errors | **Eliminated** |
|
|
|
|
### Confirmed Root Cause
|
|
|
|
**DATTO Windows Agent / MercuryFTP Protocol Performance**
|
|
|
|
Evidence:
|
|
1. VM disk reads at 76 MB/s, network transmits at 0.5 MB/s (150:1 ratio)
|
|
2. Bottleneck occurs between disk read and network transmission inside the VM
|
|
3. DATTO appliance showing resource constraints (storage full, high CPU)
|
|
4. Windows Server 2025 compatibility issues with DATTO HIR process
|
|
|
|
---
|
|
|
|
## Bandwidth Utilization Analysis
|
|
|
|
```
|
|
Available Bandwidth vs. Actual Usage
|
|
|
|
10 Gbps ─┬─────────────────────────────────────────────────── 1,250 MB/s
|
|
│
|
|
│
|
|
1 Gbps ─┼─────────────────────────────────────────────────── 125 MB/s
|
|
│ (Router inter-VLAN link - theoretical max for this path)
|
|
│
|
|
│
|
|
│
|
|
100 MB/s ┼───────────────────────────────────────────────────
|
|
│
|
|
│
|
|
10 MB/s ┼───────────────────────────────────────────────────
|
|
│
|
|
5 MB/s ┼─ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ Peak observed
|
|
│
|
|
2 MB/s ┼─ ████████████████████████████████ Average observed
|
|
│
|
|
0 MB/s ┴───────────────────────────────────────────────────
|
|
|
|
Actual backup speed: 2-5 MB/s (0.2-0.4% of available capacity)
|
|
```
|
|
|
|
---
|
|
|
|
## Business Impact
|
|
|
|
### Current State
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| CMIFS02 Data Volume | ~8.7 TB |
|
|
| Current Backup Speed | 2-5 MB/s |
|
|
| Full Backup Duration | 20-50 days (theoretical) |
|
|
| Incremental Backup Duration | Variable, often exceeds backup window |
|
|
|
|
### Risk Assessment
|
|
| Risk | Likelihood | Impact | Mitigation |
|
|
|------|------------|--------|------------|
|
|
| Incomplete backups | High | High | Resolve DATTO performance |
|
|
| Data loss in disaster | Medium | Critical | Resolve DATTO performance |
|
|
| Backup window overlap with production | High | Medium | Resolve DATTO performance |
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
### Immediate Actions
|
|
|
|
1. **Open DATTO Support Ticket**
|
|
- Provide this report as evidence
|
|
- Request investigation of MercuryFTP protocol performance
|
|
- Request review of appliance capacity (zpool full)
|
|
- Inquire about Windows Server 2025 compatibility
|
|
|
|
2. **DATTO Appliance Maintenance**
|
|
- Address "Zpool capacity exceeded" warning
|
|
- Review and clear old recovery points if possible
|
|
- Investigate "backups paused indefinitely" status
|
|
|
|
### Questions for DATTO Support
|
|
|
|
1. Why is MercuryFTP only achieving 0.5 MB/s when the network supports 1,000+ MB/s?
|
|
2. Can the backup method be changed to use VMware VADP (agentless) instead of in-guest agent?
|
|
3. Is Windows Server 2025 fully supported? (HIR failures observed)
|
|
4. What is the recommended resolution for "Zpool capacity exceeded"?
|
|
5. Are there tuning parameters for MercuryFTP transfer speeds?
|
|
|
|
### Alternative Solutions (If DATTO Cannot Resolve)
|
|
|
|
| Solution | Pros | Cons |
|
|
|----------|------|------|
|
|
| Veeam Backup & Replication | Native VMware VADP support, proven fast | Licensing cost, migration effort |
|
|
| Nakivo Backup | VMware-native, competitive pricing | Migration effort |
|
|
| VMware-level DATTO backup | Uses VADP instead of in-guest agent | May require DATTO configuration change |
|
|
|
|
---
|
|
|
|
## Appendix A: Switch Configuration Summary
|
|
|
|
```
|
|
Switch Model: HP 5406R zl2 (J9850A)
|
|
Firmware: KB.16.11.0020 (July 2024)
|
|
Management Modules: Dual (Active/Standby)
|
|
|
|
Key Ports:
|
|
- E5 (DATTOBU02): 10GbE-T, VLAN 250 untagged
|
|
- F2 (ESXi Host): 10GbE-T, VLAN 212 tagged
|
|
- A20 (Router): 1GbE, Multi-VLAN tagged
|
|
|
|
No QoS, rate limiting, or traffic shaping configured.
|
|
```
|
|
|
|
---
|
|
|
|
## Appendix B: Evidence Summary
|
|
|
|
| Evidence Type | Source | Finding |
|
|
|---------------|--------|---------|
|
|
| Switch CPU/Memory | `show system` | 0% CPU, 72% memory free |
|
|
| Packet Drops | `show system` | 0 buffers missed |
|
|
| Port Status | `show interfaces brief` | All 10Gbps links up |
|
|
| VM Disk Performance | vCenter API | 76 MB/s read, 2ms latency |
|
|
| VM Network Performance | vCenter API | 0.5 MB/s TX during backup |
|
|
| DATTO Logs | Appliance UI | Zpool full, high CPU, HIR failures |
|
|
| Backup Speed | DATTO Agent | 2-5 MB/s via MercuryFTP |
|
|
|
|
---
|
|
|
|
## Appendix C: Additional Switch Findings (Unrelated to Backup)
|
|
|
|
During the investigation, the following items were noted for separate remediation:
|
|
|
|
1. **Brute Force Login Attempts (12/14/2025)**
|
|
- Source: 10.254.50.24
|
|
- Usernames: "admin", "Cisco"
|
|
- Recommendation: Identify and investigate this device
|
|
|
|
2. **Port A24 Link Flapping (12/16/2025)**
|
|
- Third-party SFP+ DAC cable showing intermittent connectivity
|
|
- Recommendation: Replace cable
|
|
|
|
3. **DATTOBU01 (Port F5) Offline**
|
|
- Second DATTO appliance not connected
|
|
- Verify if intentional or needs reconnection
|
|
|
|
---
|
|
|
|
**Report End**
|
|
|
|
*This report was prepared using data collected from HP switch CLI, VMware vCenter Performance API, and DATTO appliance logs. All network infrastructure components have been verified as functioning correctly. The performance issue has been isolated to the DATTO backup software layer.*
|