Skip to main content

Command Palette

Search for a command to run...

The Ticket One Missed Reboot Took Down a Production VM for a Week— Here's What We Learned

Published
4 min read

Step 1 — Check the VM from Azure Portal First stop — the Azure Portal.

Azure Portal → Virtual Machines → [VM Name] → Overview

The VM was showing as Running. Power state was fine — the issue was clearly inside the OS.

Step 2 — Try Serial Console Since SSH was completely dead, we opened the Azure Serial Console to look directly inside the VM.

Azure Portal → Virtual Machines → [VM Name] → Serial Console

Result: No output. Complete silence.

The console was returning nothing — which told us this was deeper than a simple OS hang. The VM was not even reaching a stage where it could output anything to the screen.

Step 3 — Reboot the VM We performed a restart from the Azure Portal hoping a clean reboot would bring it back.

Azure Portal → Virtual Machines → [VM Name] → STOP → START

Result: No change.

VM came back to running state but remained completely inaccessible. Serial Console still silent. SSH still dead.

Step 4 — Try Restoring from Azure Backup Since the VM was not responding to anything, we decided to try restoring it from backup before escalating further.

Azure Portal → Backup Center → [VM Name] → Restore

We had multiple restore points available. We selected the most recent one and initiated the restore.

Result: Restore failed.

We tried the next restore point. Failed again. And the next one. Failed again.

Every single restore point was returning errors. Nothing was completing successfully. We were getting nowhere fast.

Step 5 — Raise a Case with Microsoft With the VM unresponsive and backups failing, we had no choice but to escalate to Microsoft Azure Support.

After investigating on their end, Microsoft came back with a critical finding:

"The EFI system partition for this VM has been altered."

This was the breakthrough we needed. EFI system partition does not change on its own — something must have modified it. Time to dig into the history.

Step 6 — Investigate the Change History We went back and checked every ticket ever raised against this VM.

We found it.

One week ago, two changes had been made:

Change 1 — Azure Admin Team: OS disk was expanded from 64 GB to 128 GB from the Azure Portal.

Change 2 — Linux Team: After the disk expansion, the Linux team went inside the VM and expanded the /var partition from 4 GB to 10 GB using Linux disk management tools.

The critical mistake: The server was never rebooted after these changes.

What Actually Happened? When the Linux team expanded /var on a live system, the partition table was modified on disk. But because the server was never rebooted, the OS continued running with the old partition layout in memory.

The EFI system partition — which tells the system exactly how to boot — was affected by this mismatch. When the VM eventually restarted due to a platform-level event, it tried to boot using a layout that no longer matched what was physically on disk.

Because we only discovered the issue one week after it happened, every single backup in that period had captured the VM in its broken state. Restoring from any of these backups simply restored the broken VM again.

Step 7 — Find a Clean Restore Point We had to go much further back — before the disk expansion ever happened.

We carefully checked the dates on all available restore points against the original change ticket timestamp. After searching through the full backup history, we finally found a restore point taken before the OS disk was expanded from 64 GB to 128 GB.

This was our last hope.

Step 8 — Restore from the Old Clean Backup We initiated the restore from this older backup point.

Azure Portal → Backup Center → [VM Name] → Restore → [Pre-Expansion Restore Point]

Result: Success.

The VM was restored to its pre-expansion state with:

Original 64 GB OS disk Original /var partition at 4 GB EFI system partition intact and clean VM booting normally SSH access restored. Application team confirmed services were back online.

Key Takeaways for Azure or Linux Admins

Always reboot after disk or partition changes — never leave a production server running with an unconfirmed partition layout or attached an additional disk and mount it with OS disk.Ever. Always take snapshot before doing any changes.in this case we had taken snapshot but as per policy we do not retain it after 6 days so it was deleted.

EFI system partition is sacred — any disk operation that touches partition tables can silently affect it. Backups are only as good as what they capture — if your VM is silently broken, every backup is backing up a broken VM.