Some thoughts and understanding of high availability

Reprinted, please indicate the original source, thank you!

In the current era of the Internet, in high concurrency and other shocks, you must also ensure that the service is high availability, if the service is not high availability, then means:

  • The system is not 7*24 hours to provide services, then the user experience is particularly bad, may not be the next user, can not retain users.
  • When the system is unavailable, it has an impact on the company’s image, and BAT, like this technology, is symbolic.
  • The most important thing is that direct loss is money when the system is unavailable!!! Are the second count loss, remember May 28, 2015 Ctrip standstill, according to Ctrip first quarter earnings data, Ctrip downtime losses for the average of $1 million 64 thousand and 800 per hour.

High availability is very complex, its own level is limited, and can not cover so much, can only be said that some of the high availability of thinking and understanding.

So how do you make your system highly usable?

We can’t let the server not hang, let the service does not hang, so how to let this inevitable situation will not be a problem, that is, can hang, the service can be bad, so how to make the system can also provide service?

First of all, if there are many machines, there are many services, even if a part of the problem is no problem, the situation will be defeated solution. The following is a step by step analysis, if the machine is stored in the specific value, then it cannot be extended, must use the machine hanging, then this is not the machine to better solve the problem, the same configuration is easy to replace, so the application service is similar, application services can not store about a state value in any machine and will not have their own internal storage specific characteristic data, if there is no way to expand easily, only when each of the main parts are the same, no different, we can replace, easy to extend, so this is called a stateless service.

If the current service is stateless, then how to make the system dynamically aware of the service hanging up? Otherwise, ask or go back to the hanging machine, how to transfer to the new machine? So maybe service discovery and registration is needed.

If you meet the above conditions, deal with the general situation basically is enough, but the Internet is a complex machine, just say bad, bad service problems, if not because of how to do network short?

So there should be heartbeat detection between services, to regularly see if it can pass (the machine is broken, the service is hung, the network is blocked), anyway, it is unreachable. This situation can be solved by service registration and discovery, but sometimes the network is flashing, then in that particular situation? For example, just to send the request to the a service is a B service, B service has received the request, then this time suddenly broken network, but B service logic processing is completed, but the a service is that there is no corresponding reaction, the timeout, then again trigger, if before the B service the logic of doing it again if there is a problem? For example, payment has been paid 200 yuan, can I pay 200 yuan again? Here you need to mention a idempotent design concept, how it is executed multiple times is idempotent, the results are the same, if there is an idempotent design so it is not afraid of this situation, in the absence of feedback to try again, there will not be a problem.

To achieve these services is to deal with the machine is broken, hang up, or no network flash situation has been basically no what big problem, so the Internet is so in high concurrency, high concurrency situation, how to improve the ability of the system?

Just like moving things, one person is slow, and more people can help with things together. Because the above structure can add machines and services, it is easy to think of more machines and services. So this must be faster than the machine, for example, 5 machines, so a lot of requests come, what strategy should be used to share them with different machines? Through the device, through some software level, but there must be a service discovery registration, or can not dynamically know, and there is some information control, black and white list, access frequency, etc.. A lot of times, adding machines may look like low, but sometimes it’s more effective, but you can’t just add the machine, and in some cases, the machine can’t solve it.

The machine is fast, if there is a blocking method in service, so even if the service in many useless, so we must pay attention to about the service timeout problem, because the service is idempotent, even if in the execution do not have any relationship with overtime will not affect the back of the card for a long time (the service the downstream service downtime, thread deadlock, downstream service busy etc.).

About synchronization, some design patterns must be asynchronous, in some sequential execution of business scenarios must be synchronized, such a scene in non essential use than asynchronous concurrent synchronous processing to the total time (so please because of experience from a single middleware many steps, the view is not necessarily synchronous fast however, from a macro point of view to improve the concurrent requests will be much larger). Simply talk about asynchronous, in a service, you need to mention the multi thread asynchronous, multi-threaded many points to improve CPU utilization, improve the system performance, but the cost is much higher, so different services directly to asynchronous, message oriented middleware, message oriented middleware (hard to ensure it first asynchronous, second needs to ensure that no weight does not leak, the 2 is really difficult, especially in the case of big data), especially I/O network needs to focus on the asynchronous model, but the Netty package good.

Because each machine or service has an upper limit, if the amount of flood discharge comes, and not his ability to deal with, then this should be solved?

The problems in life can be seen everywhere, just a good day to go home, go out to play, the matter reflected everywhere, such as security, a security card to take a look about, let the people behind, etc. check out, to let the people behind, after similar in etc.. However, if there is a high level of car, or a fast start, let them first, in which software architecture should be called limiting, service degradation, there are two kinds of control strategies (1, 2, refused to part of the request, the closure of some service) when possible before mentioned in the closing part of the service, but now not recommended (after all, also reflected the strength of technology companies), currently focus on that is about to reject some request, add about this in there? That’s the block that needs to be controlled, and it needs to be added to each layer.

Remember the industry there are words, high concurrency, high availability of three magic weapons: limiting, demotion, cache, a cache, most people should contact the characteristics of the Internet business is to read and write less, so it is very suitable for the use of the cache.

Because so at the request of a service, extended or not extended, but there are some special uniform service calls, some call is relatively small, because of the continuing division, continue the demolition, it can be increased again with.

The micro micro service, service concept, is first mentioned in the vertical resolution, it is easy to understand, after the vertical business may have a lot of, but also need to continue the horizontal split, (here are all split basis according to their own business, and deeper understanding to the better).

Through the above, the service can hang, the machine can be bad, the network can not be blocked or flashing problems are resolved, and can improve concurrency, make the greatest efforts to make the service high availability. Well, because of this, there are a lot of problems, so we need to solve the problems caused by these modifications:

  • Before a service, for transaction control very easily, so after the micro service, transaction control becomes particularly important, sometimes we can not strong consistency, but we can do the final consistency is possible.
  • Call chain monitoring is also particularly important, together with early warning is also particularly important.
  • Distributed logging is also particularly important.
  • Advanced jstack and Btrack are particularly important in real environments.

Today, probably so much, and I hope to help you, you are also learning and thinking, I hope you will pay more attention, more support, easy point praise, thank you!!!


Individual public number

Some thoughts and understanding of high availability
ingenuity zero public number.Jpg