Recovery on disk and node failures
ECS continuously monitors the health of the nodes, their disks, and objects stored in the cluster. ECS disperses data protection responsibilities across the cluster and automatically reprotects at-risk objects when nodes or disks fail.
NOTE:
|
Disk health
ECS reports disk health as Good, Suspect, or Bad.
- Good: The partitions of the disk can be read from and written to.
- Suspect: The disk has not yet met the threshold to be considered bad.
- Bad: A certain threshold of declining hardware performance has been met. When met, no data can be read or written.
ECS writes only to disks in good health. ECS does not write to disks in suspect or bad health. ECS reads from good disks and suspect disks. When two of an object’s chunks are located on suspect disks, ECS writes the chunks to other nodes.
Node health
ECS reports node health as Good, Suspect, or Bad.
- Good: The node is available and responding to I/O requests in a timely manner.
- Suspect: The node has been unavailable for more than 30 minutes.
- Bad: The node has been unavailable for more than an hour.
ECS writes to reachable nodes regardless of the node health state. When two of an object’s chunks are located on suspect nodes, ECS writes two new chunks of it to other nodes.
Data recovery
When there is a failure of a node or drive in the site, the storage engine:
- Identifies the chunks or erasure coded fragments affected by the failure.
- Writes copies of the affected chunks or erasure coded fragments to good nodes and disks that do not currently have copies.