Datacenter migration, when the network scale of data center becomes very large, it is necessary to add network equipment to realize multi-level connection.Present data centers tend to be a tree structure, the core put forward a few large capacity equipment, then hang under the multilayer devices (because the port number is not enough, may need to multilayer), dozens or even hundreds of sets of cascade network equipment together, once out of the fault, how can quickly find the fault equipment, often plagued by a lot of network operations staff.
The network equipment in the data center is redundant. As long as the fault device is found when the network fails, the business can be restored by isolating it, and then slowly troubleshoot the cause of the failure.Network failures usually get fault feedback from the application side first, and then start to troubleshoot. At this time, the application personnel usually only describe an application access failure phenomenon, and they will not tell you which specific addresses do not work to which addresses, and sometimes even wrong information, which greatly delays the problem positioning time.Problem location is spent most of the time in the troubleshooting process, how to do?How can data center networks quickly troubleshoot?This article will provide the answer.
If network failure is to be analyzed from the fault phenomenon of feedback from the application side, it is already too late, and it is easy for the application personnel to bring into the mistake area. Some of the feedback phenomenon of the application personnel is only seen by themselves, and the phenomenon is probably only a local phenomenon, which cannot reflect the failure situation of the entire network.Therefore, we should rely on ourselves to do a good job of network monitoring, and find problems through monitoring, so as to quickly find fault equipment, do equipment isolation or troubleshooting.
The early network monitoring was mainly to monitor some log and port traffic of the device. More often, this information was not enough and problems could not be detected in time.Many network device manufacturers say their device logs are complete, but there are still some extreme cases or software bugs that cause failure when there is no log output, so it is necessary to locate the traffic.At this time, it is necessary for the network personnel to find the application personnel to understand the fault phenomenon, find out some lost packets or blocked IP addresses on the spot, and then conduct network circulation. All the equipment that the fault traffic passes through will be circulated to find the fault equipment.Now that is a tree network, each layer there are a lot of equipment, the circulation is considerable, and not all of the devices are can do support of all the characteristics of the traffic statistics, there are not supported by the device will make statistics, increase the difficulty of finding fault equipment, do network operations over the years are so insist on.
Obviously, the previous network troubleshooting method was effective, but the efficiency was too low, and the fault location time was long, which had a great impact on the business.Current network monitoring is aimed at data flow, monitoring specific data flow in the network, so as to find out the fault location once the data flow is interrupted.Here, several emerging network monitoring methods, also known as network visualization technology, are mentioned as the most effective methods for rapid obstacle elimination.
First of all, INT(in-band Network Telemetry, in-band Network Telemetry technology) technology enables the monitoring of Network state by collecting and reporting Network state at the data level.When the data message enters the first network equipment, equipment set on sampling and image out of the business flow of packet sampling way, INT based on packet encapsulation an INT head, and will need to collect information about the switch to fill in the INT data segment, message through all network equipment such processing, until finally a server connection of network equipment to spin out the INT head.
Message after each device will be collected INT message through gRPC message is sent to a remote monitoring server parsed and rendered, INT carried in the message delay of packet forwarding, equipment, congestion, etc., can be presented to the monitoring server, once the data message appear lost package or not, the monitoring server immediately perceived, and a few seconds to determine the scope of the problem and fault equipment.
Followed by ERSPAN (Encapsulated Remote Switch Port Analyzer, Remote monitoring of network traffic across the three layers of IP transmission technology), ERSPAN message based on the GRE encapsulation, and through the Ethernet forwarded to any IP routing can reach place.ERSPAN is to source port message a copy through the GRE (Generic Routing Encapsulation) sent to the server parsed, the physical location of collection server is not restricted, so that we can be the key to the whole network traffic forwarding through ERSPAN sent to the monitoring server, where traffic is part of the network appeared, be clear at a glance.The third is the sFlow and Netstream, these two are the data sampling technique, Netstream collection of relatively complete, but need to have dedicated hardware to complete, in the network deployment sFlow and Netstream, can pass gRPC will monitor data sent to the server, calculated by the monitoring server, and finishing, and graphical display the results appear, once part of the network has a problem, which can immediately appeared on the monitoring server.
SFlow and Netstream collect the main features of the header, rather than the whole content of the message. This is quite different from INT and ERSPAN, and there is no problem in troubleshooting most network faults.In a network, it does not mind the deployment of all three monitoring schemes, so that in case of failure, the data collected from multiple angles can be analyzed.
It is also important to try to send the data collection to the monitoring server through the management network. Otherwise, if there is a problem with the data network, the monitored data may not reach the monitoring server normally.In most cases, the failure of data network rarely affects the management network, and all devices can still be accessed normally. If the failure occurs, many devices cannot be accessed through the management network, it can be basically determined that this device is the fault point.
With the above network monitoring methods, it is not difficult to find the fault at the first time, and it can be fully automated. When the fault is found, the monitoring server will automatically issue the isolation command to isolate the fault equipment and recover it automatically.In this way, the network fault location can be found before the application failure is reported, the fault equipment can be isolated in time, and the service can be restored. In this way, the time of fault analysis can be greatly shortened, which has little impact on the business, and even the business part cannot perceive the fault at all.The actual application effect of network monitoring technologies such as INT and ERSPAN is still unknown. These technologies are always mentioned recently and need to be tested in practice.SFLOW and Netstream are relatively mature technologies, but they are not really used in network troubleshooting, so they need to be popularized.
With these monitoring technologies, network failures can be quickly eliminated, which is of great significance to the operation and maintenance of data centers and greatly improves the operation and maintenance efficiency.
沒有留言:
張貼留言