-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Bug description
If a CSV is created on a four node cluster and then the fourth node is evicted from the cluster, the CSV that was created while the cluster had four hosts will have the wrong number of columns. As a result, the CSV becomes a loaded time bomb and will eventually cause workloads to seize up if enough data is added and/or changed on the CSV and the CSV will fail if it is expanded, causing data loss.
Any attempt at trying to bring the CSV online produces either "Access Denied" errors or Error Code 0x8007054f "An internal error occurred".
Repro steps
- Join a node into a 3 node Azure Local cluster to make it into a 4 node cluster.
- Create a new CSV with the fourth node joined and operable. Make sure that the fourth node owns it for a little while.
- Move the ownership of the fourth newly created CSV to one of the other three nodes.
- Evict the fourth node from the cluster using the remove-mocphysicalnode and remove-clusternode -cleanupdisks commands. (As if the cluster was being permanently shrunk.)
- Take the fourth node offline.
- Expand the size of the newly created fourth CSV by some amount (Such as 5 TB) using the WAC
- Observe that newly created CSV goes into failed state and any running VM's on it will also fail.
(Note that all CSV's were encrypted, this may nor may not be reproducible with unencrypted CSV's.)
Expected behavior
- Expanding a CSV should not cause a CSV to go offline. Nor should added CSV's have mismatched columns to the other CSV's, causing workloads to eventually seize up and fail after a certain period of time.
- Mismatched columns between CSV's should be automatically corrected with the addition or subtraction of nodes in the cluster. There also should be code that automatically detects CSV's with mismatched columns and automatically corrects for it without user input, treating it no more differently than an automatic array repair after a disk is replaced.
- The node eviction may have been done improperly. Where is the documentation for Azure Local?
- Attempts at bringing the CSV online with Hyper-V tools shouldn't have produced errors, nor should attempts at trying to bring a failed CSV back online fail.
Environment (please complete the following information):
Build 12.2512.1002.16
4 node cluster
Production
East US