WebSocket million long connection technology, practice in graphite documentation

Home > Sci-Tech

WebSocket million long connection technology, practice in graphite documentation

2022-01-15 12:04:29 9 ℃

The web server push technology has been developed by long polling, short polling, and finally the WebSocket specification brought by HTML5 standards has gradually become the mainstream technical solution in the industry. It makes message push, message notification, etc. have become extremely simple, so how to practice the WebSocket gateway under millions of levels? This paper is sorted out from the technical practice of the Senior Engineer Du Hao 's Reconstruction of Graphite WebSocket Gateway.

1 Introduction

In part of the graphite document, such as document sharing, comments, slide demonstrations, etc. The data of the data, so selecting the WebSocket program for business development.

With the development of graphite document business, the current day connection peak has reached a million level, and the growing user connection and the current architectural design have led to a sharp increase in memory and CPU usage, so we consider the weight of the gateway. Structure.

2 gateway 1.0

The gateway 1.0 is the version of the Node.js based on Socket.IO to modify the development, which is very good to meet the business scenario demand at the time of user level.

2.1 architecture

Gateway 1.0 version architecture design diagram:

Gateway 1.0 client connection process:

The user connects the gateway through the NGINX, which is sensible by the service service;

After the service service is perceived, the relevant user data query will be performed, and the message will be pub to redis;

The gateway service received a message via Redis Sub;

Query the user session data in the gateway cluster and push to the client.

2.2 pain points

Although the 1.0 version of the gateway is working on the line, it is not possible to support the expansion of the follow-up business, and there are several questions to be resolved:

Resource consumption: nginx only uses a certificate, most requests are transmitted, producing a certain resource waste, and the previous Node gateway is not good, consumes a lot of CPU, memory.

Maintenance and observation: The monitoring system of can't access graphite, there is a certain difficulty in maintenance with existing monitoring alarms;

Business Coupling Problem: Business Services and Gateway features are integrated into the same service, which cannot be targeted for the performance of business partial performance loss. To resolve performance issues, and subsequent module extension capabilities require service decoupling.

3 gateway 2.0

Gateway 2.0 needs to solve many problems: there are many components inside the graphite document: documentation, tables, slides, and forms, etc. The service call of the component to the gateway in version 1.0 can pass: Redis, Kafka, and HTTP interfaces, sources are not validable, managed difficulties.

In addition, from the perspective of performance optimization, it is also necessary to decouple the original service, and split the 1.0 version gateway to the gateway function section and the service section, the gateway function section is WS-GATEWAY: integrated user authentication, TLS certificate verification and WebSocket connection management, etc .; the business processing section is WS-API: Component Services directly with the service GRPC communication.

A targeted expansion can be performed for specific modules; service reconstruction plus NGINX removal, and overall hardware consumption is significantly reduced; service integration into graphite monitoring system.

3.1 overall architecture

Gateway 2.0 version architecture design diagram:

Gateway 2.0 client connection process:

The client is connected to the WS-Gateway service to establish a WebSocket connection through a handshake process;

After the connection is established, the WS-GATEWAY service will store the session, caching the connection information mapping relationship into the REDIS and pushes the client online message to the WS-API through Kafka;

WS-API receives client online messages and client uplink messages through Kafka;

WS-API service pre-processing and assembly messages, including the necessary data pusher from the Redis, and performs the filtering logic of the completion message, then the PU message to Kafka;

WS-GATEWAY Gets the message that needs to be returned by Sub Kafka, pushes a message to the client one by one.

3.2 Handshake Process

When the network state is good, after steping 1 to 6, step 1 to 6, directly enter the WebSocket process; if the network environment is poor, the communication mode of Websocket will be degraded into http mode, and the client pushes the message through the POST. The server is returned from the read server by the GET long polling. The client will first request the handshake process established by the server connection:

Client Send GET Request Attempts to establish a connection;

Server returns the associated connection data, the SID is the unique socket ID generated by this connection, follow-up interaction as a credential; Client carries the SID parameter in step 2 again;

Server returns 40, indicating that the request is successful;

Client Send a POST request to confirm the post-downgrade pathway;

Server returns OK, at this time, the first phase of handshake is completed;

Try to initiate a WebSocket connection, first perform a request response of 2Probe and 3Probe, confirm that the communication channel is unblocked, you can perform normal WebSocket communication.

3.3 TLS Memory Consumption Optimization

The WSS protocol used by the client and the server connection, in which the TLS certificate is mounted in Nginx in the version 1.0 version, the HTTPS handshake process is completed by Nginx, in order to reduce NGINX machine costs, we mount the certificate on the service in version 2.0. By analyzing service memory, as shown below, the memory consumed during the TLS handshake has a total memory consumption of about 30%.

This part of memory consumption cannot be avoided, we have two options:

With seven-layer load balancing, TLS certificate mount is performed on the seven-layer load, and the TLS handshake process is handed over to the performance of better performance;

Optimized GO's handshake process performance in TLS, learned from the exchange of Cao Chunhui (Cao Da) in China, and he recently submitted PR in Go official library, and related performance test data.

3.4 Socket ID Design

A unique code must be generated for each connection, if repeated can cause serial numbers, messaging issues. Select the Snowflake algorithm as a unique code generation algorithm.

In the physical event, the physical machine where the copy is located is fixed, that is, the Socket ID generated by the service on each copy can be guaranteed to be a unique value.

In the K8S scene, this scheme is not feasible, so it is returned by the registration, and all copies of WS-Gateway are started to write to the database to the database, obtain the copy number, and use the copy number as the Snowflake algorithm. Perform socket ID production, service restart will inherit the previous copy number, and new version will be issued in accordance with the new copy number according to the self-increasing ID. At the same time, the WS-Gateway copy will write a heartbeat information to the database to use the health inspection basis for the gateway service itself.

3.5 Cluster Session Management Plan: Event Broadcast

After the client completes the handshake process, the session data stores in the current gateway node, and partially serialized data is stored to Redis, and the storage structure is as follows:

The message pusher triggered by the client trigger or component service, the data structure stored by the Redis, queries the Socket ID of the target client that returns the message body in the WS-API service, and then the WS-Gateway service is consumed, if the socket ID is not At the current node, you need to perform a query of the node and the session relationship, find the WS-Gateway node corresponding to the actual guest Socket ID, usually there are two options:

After determining messaging between gateway nodes in the gateway node using an event broadcast, it is further selected which specific message middleware that uses three types to be selected:

Thus, the REDIS and other MQ middleware performed 100W, and the queue operation was found during the test, and the REDIS performance performance was very excellent when the data was less than 10K, and further combined with the actual situation: the data volume of the broadcast content is about 1k. Business scene is simple and fixed, and to compatibility with historical business logic, finally selected Redis for message broadcast.

The WS-API can also be interconnected from the WS-API with WS-Gateway, and use GRPC Stream two-way traffic signals to save internal traffic.

3.6 heartbeat mechanism

After the session is stored in the Node Memory and Redis, the client needs to continue to update the session timestamp by the heartbeat. The client is based on the cycle issued by the server, and the timestamp will first update in memory, and then pass the additional cycle Perform Redis synchronization, avoiding a large number of clients while performing a heartbeat to generate pressure on Redis.

After the client establishes a WebSocket connection, the server is jumped on the server;

The client is based on the above parameters, and the server will update the session timestamp after receiving the heartbeat;

The client other uplink data triggers the corresponding session timestamp update;

The service is timed to clean up the timeout session, perform active shutdown processes;

The relationship between WebSocket connections, users, and files is cleaned by Redis updated timestamp data. Session data memory and Redis cache cleaning logic:

On the existing two-level cache refresh mechanism, further reducing the server performance pressure generated by the heartbeat, the client is closed by the client, and the client is reported to the server. The number of connections, the current QPS is:

From the perspective of server performance optimization, realize the dynamic interval under normal condition, every x normal heartbeat is reported, the heartbeat interval adds A, increase the upper limit Y, the dynamic QPS minimum is:

In the limits, the QPS generated by the heartbeat reduces Y times. After a single heartbeat premium, the server immediately turns a value to 1s for re-test. The above strategy is used to reduce the performance loss generated by the heartbeat to the server while ensuring the quality of the connection.

3.7 Custom Headers

The purpose of using Kafka Custom Headers is to avoid performance loss from the gateway layer to the message. After the client Websocket connection is successful, a series of business operations will be performed, and we choose the operation instructions and necessary The parameter is placed in the Headers of Kafka, for example by broadcasting, read file numbers, and push all users within the file.

The Trace ID and timestamp are written in Kafka Headers, which can be chased a full consumption link of a message and time consumption at each stage.

3.8 Message Receive and Send

The client and the server's message interacts the first version of the first edition is similar to the above, and the Demo is compressed. Every WebSocket connection will take 3 Goroutine, each Goroutine requires the memory stack, and the stand-alone bearer is very limited, mainly subject to A large amount of memory is occupied, and most of the time is idle, so it is considered whether only 2 goroutine is enabled to complete the interaction.

Keter Goroutine, if you read data from the buffer using polling mode, you may generate a problem that read delays or locks, which adjusts to active calls, do not use starting GOROUTINE to continuously listen, reduce memory consumption.

GEV and GNET, etc., based on event-driven lightweight high-performance network libraries, measured the problem that the message delay that may be generated under a large number of connection scenes is found, so it is not used in the production environment.

Heart object cache

After determining the data reception and transmission logic, the core object of the gateway section is a Connection object, which is developed around the Connection function. Use sync.pool to cache the object, reduce GC pressure, when you create a connection, get the Connection object by the object resource pool, after the life cycle is completed, reset the Connection object after the PUT back to the resource pool. In actual coding, it is recommended to encapsulate, number, convergence data initialization, object reset, etc.

3.10 Optimization of Data Transmission Process

During the messing process, it is necessary to consider the transmission efficiency optimization of the message body, and the MessagePack is used to serialize the message, and the system is compressed. Adjusting the MTU value avoids a subcontracting situation, defining A is the size of the probe package, and the target service IP is detected by the target service IP.

At the time, the actual transmission package is: 1428. Where 28 is composed of 8 (ICMP returns request and echo response news) and 20 (IP head).

If the A is set to cause a response timeout, a subcontract will occur when the actual environmental package size exceeds this value.

When debugging the appropriate MTU value, the sequence number is serial numbers through MessagePack, and the size of the packet is further compressed and the consumption of the CPU is reduced.

3.11 Infrastructure Support

Serving service development using the EGO framework: business log printing, asynchronous log output, dynamic log level adjustment and other functions, convenient online problem investigation and improvement log printing efficiency; micro service monitoring system, CPU, P99, memory, goroutine and other monitoring.

Client Redis monitoring:

Client Kafka monitoring:

Custom monitoring market:

4 performance pressure test

4.1 Pressure Test Preparation

Select a virtual machine configured as 4-core 8G, as a server, target bearer 48W connection;

Select eight virtual machines configured as 4 core 8G, as a client, each client opens 6W ports.

4.2 Scene

User is online, 50W online users.

Single WS-Gateway established a number of connections per second: 1.6W / S, each user occupies memory: 47K.

4.3 Two Scenes

Test time 15 minutes, online users 50W, push all users every 5s, and the user has back. Push content is:

After 5 minutes, the service is abnormally restarted, and the reason is that the memory usage is exceeded. Analyze the reason for memory exceeding limit:

The new broadcast code uses 9.32% of memory.

The part of the receiving user receipt message consumes 10.38% of memory.

Test rules adjustment, test time 15 minutes, online users 48W, all users are pushed every 5s, and users have back. Push content is:

The number of connections creates a peak: 1W / S, receive data peak: 9.6W bar / s, send data peak 9.6W bar / s.

4.4 Scene

Test time 15 minutes, online users 50W, all users are pushed every 5s, and users don't have to pay. Push content is:

The number of connections establishes peak: 1.1W / S, send data peak 10W bar / s, excessive memory occupies, other abnormal conditions.

The memory consumption is extremely high, analyzing the flame map, most of which consumes the operation of the time 5s.

4.5 Scene

Test time 15 minutes, online users 50W, push all users every 5s, and the user has back. The 4W user is on the line per second. Push content is:

The number of connections is established: 18570 / s, receive data peak: 329949 bar / s, send data peak 393542 strips / s, no abnormal conditions.

4.6 Summary

Under the hardware conditions of 16C 32G memory, the number of stand-alone 50W connections, the above including the four scenes of the user, the message receipt, etc., the memory, and CPU consumption are in line with the expectations, and at a longer period of time, service Also stable. Meet the resource savings requirements under the current level, you can continue to improve the functional development on this basis.

5 summary

The increasing number of users, the reform of gateway services is imperative, and this reconstruction is mainly:

The decoupling of gateway services and business services removes the dependence on Nginx, allowing overall architectures clearer.

The overall process analysis connected to the underlying service push message is established from the user, and these processes have been specifically optimized. The following aspects give 2.0 version of the gateway with fewer resource consumption, lower unit user memory loss, more complete monitoring alarm system, so that the gateway service itself is more reliable:

Dowgradeable handshake process;

Socket ID production;

Optimization of client heartbeat processing process;

Custom Headers avoids message decoding, strengthens link tracking and monitoring;

Optimization of the reception of the message and the design of the transmission code structure;

The use of the target resource pool uses the cache to reduce the GC frequency;

Serialized compression of the message body;

Access service observation infrastructure to ensure service stability.

While ensuring the gateway service performance, the convergence of the underlying component service to the gateway service call, from the previous HTTP, REDIS, KAFKA, etc., unified to GRPC call, to ensure that the source can check the control, subsequent Business access has made a better basis.

6 Q & A

The discussion of some articles related content is included:

6.1 Value of SocketID

Question: According to my understanding of the value of Socketid, Kafka consumers need to find the corresponding TCP link according to SocketID, since you already have a custom gateway, what is the meaning of introducing Kafka? Wellness of messages? Why not do load balancing on the gateway layer, let the nodes communicate directly with the client. In addition, I guess consumer senders need to do Hash according to SocketID and send it to the corresponding partition. Once the initial partition is too small, the client and the server have to restart or upgrade, I don't know where to introduce Kafka, where is it, Instead, it has greatly increased the complexity and maintenance cost of the architecture, and the scalability is not so good. If it is a short link of HTTP, it can be understood.

Answer: There is no SLB in the figure, which is a load balance. We did not use socket id has to correspond to the corresponding partition, Kafka's role is to do not need to care for the order and push messages inside the gateway. If there is no kafka, then the component or gateway scrolls the update, the user is reinable, it may lose Message; for messages that require sequential sequential, if the pong mode can be identified by the gateway to the CMD information in the Header header, find the corresponding backend, distribute the message.

6.2 REDIS for message broadcasting

Question: The amount of data volume of broadcast content is about 1k, and the business scene is simple and fixed, and the history business logic is to be compatible, and finally the Redis is broadcast. Is the API and gateway interacting through Kafka, what does it mean?

Answer: The consumption of the gateway node to Kafka is a cluster mode.If Kafka, under K8S conditions, use the broadcast mode is more troublesome.So the old gateway uses Redis to make Pubsub's broadcast, in order to be compatible with the old logic, still use Redis to broadcast.At the same time, we intend to do two-two interconnections directly from API and WS, broadcast through GRPC Stream, and better scalability.7 technical link

Kafka, Redis, MySQL client monitoring SDK: