What? Connection reset by peer?
We are running Node.js web services behind AWS Classic Load Balancer. I noticed that many 502 errors after I migrate AWS Classic Load Balancer to Application Load Balancer. In order to understand what happened, I added Nginx in front of the Node.js web server, and then found that there are more than 100 'connection reset' errors everyday in Nginx logs.
Here are some example logs:
2017/11/12 06:11:15 [error] 7#7: *2904 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:27 [error] 7#7: *2950 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:31 [error] 7#7: *2962 upstream prematurely closed connection while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:44 [error] 7#7: *3005 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:47 [error] 7#7: *3012 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
Analyzing the errors
The number of errors was increased after I migrate Classic LB to Application LB, and one of the differences between them is Classic LB is using pre-connected connections, and Application LB only using Http/1.1 Keep-Alive feature.
From the documentation of AWS Load Balancer:
Possible causes:
- The load balancer received a TCP RST from the target when attempting to establish a connection.
- The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target.
- The target response is malformed or contains HTTP headers that are not valid.
- A new target group was used but no targets have passed an initial health check yet. A target must pass one health check to be considered healthy
Because there's no any errors on Node.js side, so I'm guessing it was because of the keep-alive behaviour. There's no upstream keep-alive timeout settings in the current Nginx version (1.13.6), then I tried Tengine - a taobao forked nginx which support upstream keepalive timeout. After running a couple of days, there's no such errors any more. In order to understand what's causing the issue, I tried to reproduce it on my local machine.
Capture network packages
In order to send a request at the same time when Node.js closing the connection after keep-alive timeout, I need to keep requesting the url until the issue reproduced. Here's my settings for the testing environment:
Upstream (Node.js server):
- Set keep-alive timeout to 500 ms
Test client:
- Keep sending requests with an interval
- Interval starts from 500 ms and decrease 1 ms after each request
For the normal requests, upstream send a [FIN, ACK] to nginx after keep-alive timeout (500 ms), and nginx also send a [FIN, ACK] back, then upstream send a [ACK] to close the connection completely.
No. Time Source Destination Protocol Length Info 1 2017-11-12 17:11:04.299146 172.18.0.3 172.18.0.2 TCP 74 48528 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=0 WS=128 2 2017-11-12 17:11:04.299171 172.18.0.2 172.18.0.3 TCP 74 8000 → 48528 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=32031305 WS=128 3 2017-11-12 17:11:04.299194 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32031305 TSecr=32031305 4 2017-11-12 17:11:04.299259 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 5 2017-11-12 17:11:04.299267 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32031305 TSecr=32031305 6 2017-11-12 17:11:04.299809 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html) 7 2017-11-12 17:11:04.299852 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32031305 TSecr=32031305 8 2017-11-12 17:11:04.800805 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [FIN, ACK] Seq=206 Ack=176 Win=30080 Len=0 TSval=32031355 TSecr=32031305 9 2017-11-12 17:11:04.801120 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [FIN, ACK] Seq=176 Ack=207 Win=30336 Len=0 TSval=32031355 TSecr=32031355 10 2017-11-12 17:11:04.801151 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [ACK] Seq=207 Ack=177 Win=30080 Len=0 TSval=32031355 TSecr=32031355
For the failed requests, the upstream closed the connection after keep-alive timeout (500 ms), the client sends a new http request before it receives and processes the [FIN] package. Because of the connection has been closed from upstream's perspective, so it send a [RST] response for this request. This would happen in following scenarios:
- Upstream hasn't send the [FIN] package yet (pending to send package at network layer)
- Upstream has sent the [FIN] package, but client hasn't received it yet
- Client received the [FIN] package, but hasn't processed it yet
Example of the first scenario, hasn't send [FIN] yet:
No. Time Source Destination Protocol Length Info 433 2017-11-12 17:11:26.548449 172.18.0.3 172.18.0.2 TCP 74 48702 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=0 WS=128 434 2017-11-12 17:11:26.548476 172.18.0.2 172.18.0.3 TCP 74 8000 → 48702 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=32033530 WS=128 435 2017-11-12 17:11:26.548502 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32033530 TSecr=32033530 436 2017-11-12 17:11:26.548609 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 437 2017-11-12 17:11:26.548618 172.18.0.2 172.18.0.3 TCP 66 8000 → 48702 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32033530 TSecr=32033530 438 2017-11-12 17:11:26.549173 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html) 439 2017-11-12 17:11:26.549230 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32033530 TSecr=32033530 440 2017-11-12 17:11:27.049668 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 441 2017-11-12 17:11:27.050324 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html) 442 2017-11-12 17:11:27.050378 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=351 Ack=411 Win=31360 Len=0 TSval=32033580 TSecr=32033580 443 2017-11-12 17:11:27.551182 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1 444 2017-11-12 17:11:27.551294 172.18.0.2 172.18.0.3 TCP 66 8000 → 48702 [RST, ACK] Seq=411 Ack=526 Win=32256 Len=0 TSval=32033630 TSecr=32033630
Example of the second scenario, Sent [FIN] at the same time of receiving a new request:
No. Time Source Destination Protocol Length Info 13 2018-06-15 21:40:00.522110 127.0.0.1 127.0.0.1 TCP 68 50678 > 8000 [SYN] Seq=0 Win=65535 Len=0 MSS=16344 WS=32 TSval=1503957438 TSecr=0 SACK_PERM=1 14 2018-06-15 21:40:00.522210 127.0.0.1 127.0.0.1 TCP 68 8000 > 50678 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=32 TSval=1503957438 TSecr=1503957438 SACK_PERM=1 15 2018-06-15 21:40:00.522219 127.0.0.1 127.0.0.1 TCP 56 50678 > 8000 [ACK] Seq=1 Ack=1 Win=408288 Len=0 TSval=1503957438 TSecr=1503957438 16 2018-06-15 21:40:00.522228 127.0.0.1 127.0.0.1 TCP 56 [TCP Window Update] 8000 > 50678 [ACK] Seq=1 Ack=1 Win=408288 Len=0 TSval=1503957438 TSecr=1503957438 17 2018-06-15 21:40:00.522315 127.0.0.1 127.0.0.1 HTTP 189 GET / HTTP/1.1 18 2018-06-15 21:40:00.522358 127.0.0.1 127.0.0.1 TCP 56 8000 > 50678 [ACK] Seq=1 Ack=134 Win=408160 Len=0 TSval=1503957438 TSecr=1503957438 19 2018-06-15 21:40:00.522727 127.0.0.1 127.0.0.1 HTTP 261 HTTP/1.1 200 OK (text/html) 20 2018-06-15 21:40:00.522773 127.0.0.1 127.0.0.1 TCP 56 50678 > 8000 [ACK] Seq=134 Ack=206 Win=408064 Len=0 TSval=1503957438 TSecr=1503957438 21 2018-06-15 21:40:01.025685 127.0.0.1 127.0.0.1 HTTP 189 GET / HTTP/1.1 22 2018-06-15 21:40:01.025687 127.0.0.1 127.0.0.1 TCP 56 8000 > 50678 [FIN, ACK] Seq=206 Ack=134 Win=408160 Len=0 TSval=1503957939 TSecr=1503957438 23 2018-06-15 21:40:01.025748 127.0.0.1 127.0.0.1 TCP 44 8000 > 50678 [RST] Seq=206 Win=0 Len=0 24 2018-06-15 21:40:01.025760 127.0.0.1 127.0.0.1 TCP 56 50678 > 8000 [ACK] Seq=267 Ack=207 Win=408064 Len=0 TSval=1503957939 TSecr=1503957939 25 2018-06-15 21:40:01.025769 127.0.0.1 127.0.0.1 TCP 44 8000 > 50678 [RST] Seq=207 Win=0 Len=0
When the client receives the [RST] package, it will log a 'Connection reset' error.
Testing Code
This issue is a generic issue when closing the connection on the server side while HTTP keep-alive enabled, so you can easily reproduce it by clone the example code (Node.js) from https://github.com/weixu365/test-connection-reset
npm install npm start # In a separate terminal npm run client
- Upstream Node.js Web Server
const express = require('express'); const app = express(); app.get('/', (req, res) => res.send('OK')); const port = process.env.PORT || 8000; app.listen(port, () => { console.log(`Listening on http://localhost:${port}`) }) .keepAliveTimeout = 500;
- Test client in Node.js
const axios = require('axios'); const Bluebird = require('bluebird'); const HttpAgent = require('agentkeepalive'); const keepAliveOption = { freeSocketKeepAliveTimeout: 30 * 1000, // Should be less than server keep alive timeout socketActiveTTL: 50 * 1000 // Should be less than dns ttl }; const httpAgent = new HttpAgent(keepAliveOption); let host = 'http://localhost:8000'; let path = '/'; const httpClient = axios.create({ baseURL: host, timeout: 5000, }); const sendRequest = () => httpClient.request({ url: path, httpAgent, }) .then(res => { console.log('Received response', res.status); }) .catch(e => { console.error('Error occurred', e.message); }); let delay=501; const start = () => sendRequest() .then(() => delay -= 1) .then(() => delay > 450 ? Bluebird.delay(delay).then(start) : 'Done') start();
- Capture network packages
tcpdump -i eth0 tcp port 8000 -w /tmp/connection.pcap
- Nginx config (Optional)
upstream nodejs { least_conn; server chalice:8000 fail_timeout=1s max_fails=3; keepalive 16; } server_tokens off; log_format detailed escape=json '{' '"timestamp": "$time_iso8601",' '"remote_addr": "$remote_addr",' '"upstream_addr": "$upstream_addr",' '"connection": "$connection",' '"connection_requests": "$connection_requests",' '"request_time": "$request_time",' '"upstream_response_time": "$upstream_response_time",' '"status": "$status",' '"upstream_status": "$upstream_status",' '"body_bytes_sent": "$body_bytes_sent ",' '"request": "$request",' '"http_user_agent": "$http_user_agent"' '}'; access_log /var/log/nginx/access.log detailed; server { listen 80; server_name localhost; location / { proxy_http_version 1.1; proxy_redirect off; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarder-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; proxy_set_header X-NginX-Proxy true; proxy_set_header Connection ""; proxy_pass http://nodejs; } }