Datacenter migration, before considering the complexity of data center design, it is necessary to consider the use of a flexible system without single point of failure (SPOF). By definition, a single point of failure (SPOF) is a component that, once the system fails, makes the entire system inoperable. In other words, a single point of failure produces an overall failure. . These may be component failures or incorrect human intervention, such as switching without knowing how the system reacts.
2N redundant system can be regarded as a minimum requirement for SPOF installation. For simplicity, it is assumed that the 2N system of the data center consists of two identical electrical and mechanical systems, A and B. Fault tree analysis (FTA) will highlight the combination of events that cause failure. However, it is very difficult to simulate human errors in fault tree analysis (FTA). The data used to simulate human errors will always be subjective, and there are many variables.
If the system in this 2N redundant system example is physically separate, any operation on one system should have no effect on the other. However, the introduction of enhancements is not uncommon. It uses a simple 2N redundant system and adds other components, such as disaster recovery links and public storage containers connecting the two systems.
In large-scale design, this becomes an automatic control system (such as SCADA, BMS), rather than a simple mechanical interlock. The basic principles of 2N redundant system have been destroyed, and the complexity of the system has increased exponentially. The same is true of the skills required by the operational team.
A review of the design still shows that 2N redundant design has been achieved, but the resulting complexity and operational challenges undermine the basic requirements of high availability design.
Studies have shown that a particular sequence of events that lead to failure is usually unpredictable and will not know what the consequences will be until it happens. In other words, the sequence of events is unknown before people know. Therefore, it will not become part of fault tree analysis (FTA).
Austrian physicist Ludwig Von Boltzmann has developed an entropy equation that has been applied to statistics, especially for missing information. In this theory, a box grid, such as a 4 x 2 or 5 x 4 grid, and a coin in the box are set. The theory allows users to determine the number of problems to determine which box to place coins on the defined grid. If you replace boxes with system components and coins with unknown failure events, one can consider how the system availability is affected by complexity. It can be seen that the number of unknown failure events that occur less frequently can reduce the number of failures that the system can fail. Therefore, increasing people's detailed knowledge of the system and discovering unknown events reduces the combination of system failures, thereby reducing the risk.
human factor
Research shows that any system with human-machine interface will eventually fail due to loopholes. Vulnerabilities are any possible vulnerabilities that may cause failures in data center facilities. Data center vulnerabilities may be related to infrastructure or facility operation. Infrastructure involves equipment and systems, in particular:
Mechanical and electrical reliability.
Facilities design, redundancy and topology.
These actions involve human factors, including human errors at the individual level and management level. It involves:
• operational team adaptability.
Team reaction to vulnerabilities.
The more complex the system, the more vulnerable the human factor is, the more training and learning the facilities need. Learning is applicable not only to individuals, but also to organizations. Organizational learning is characterized by maturity and processes (shown below as cumulative experience), such as around data center structures and resources, maintenance, change management, document management, debugging and operability, and maintainability.
Personal learning is a function of knowledge, experience and attitude (as shown in the chart as the depth of experience). Developing an organizational and personal learning environment helps reduce failure rates and provides operators with expertise that effectively reduces energy waste.
Universal learning curve applied to data center
It is important to understand that zero failure can never be achieved because the relationship between failure and experience follows an exponential curve. Data center facility operators with good knowledge and experience are still prone to complacency and to failure in a series of previously unknown events.
conclusion
By providing a learning environment that improves organizational and personal knowledge, it reduces the risk of data center. Although sophisticated operators have experience in reducing failure rates, too complex designs can still fail if implemented without adequate training.
沒有留言:
張貼留言