Introduction to ECS site outage and recovery

ECS is designed to provide protection when a site outage occurs due to a disaster or other problem that causes a site to go offline or become disconnected from the other sites in a geo-federated deployment.

Site outages can be classified as a temporary site outage (TSO) or a permanent site outage (PSO). A TSO is a failure of the WAN connection between two sites, or a temporary failure of an entire site (for example, a power failure). A site can be brought back online after a TSO. ECS can detect and automatically handle these types of temporary site failures.

A PSO is when an entire site becomes permanently unrecoverable, such as when a disaster occurs. In this case, the System Administrator must permanently fail over the site from the federation to initiate failover processing.

TSO and PSO behavior is described in the following topics:

NOTE: For more information about TSO and PSO behavior, see the ECS High Availability Design white paper.
NOTE: Do not delete buckets during rejoin after TSO.
NOTE: Best practices for administrators to consider while configuring network bandwidth for ECS replication:
  • For most scenarios, ECS replicates the same amount of data that it ingests. Best practice for replication bandwidth allocation should be at least equal or greater the front-end injection rate.
  • In full replication mode, bandwidth that is required is also dependent on the number of VDC in the full replication RG. For example, if a user has four VDC in an RG, the replication network bandwidth that is required is three times the front-end ingest rate.
  • Network that is used for replication must be stable under high utilization scenarios. It should avoid or account for additional load such as load from firewall.
  • When failure scenario such as a PSO, TSO, or VDC extend operation happens, there could be a backlog that is generated. This increases the bandwidth that is required by replication to catch up and clear the backlog. Administrators must account for these situations to avoid network saturation or in worst case network failure. Best practices to consider:
    • Use a third-party QoS method to throttle network used by replication.
    • Discuss options with Dell Service provider, to tune the system based on your network situation.

ECS recovery and data balancing behavior is described in these topics: