What? Connection reset by peer?
We are running Node.js web services behind AWS Classic Load Balancer. I noticed that many 502 errors after I migrate AWS Classic Load Balancer to Application Load Balancer. In order to understand what happened, I added Nginx in front of the Node.js web server, and then found that there are more than 100 'connection reset' errors everyday in Nginx logs.
Here are some example logs:
2017/11/12 06:11:15 [error] 7#7: *2904 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:27 [error] 7#7: *2950 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:31 [error] 7#7: *2962 upstream prematurely closed connection while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:44 [error] 7#7: *3005 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost" 2017/11/12 06:11:47 [error] 7#7: *3012 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
Analyzing the errors
The number of errors was increased after I migrate Classic LB to Application LB, and one of the differences between them is Classic LB is using pre-connected connections, and Application LB only using Http/1.1 Keep-Alive feature.
From the documentation of AWS Load Balancer:
Possible causes:
- The load balancer received a TCP RST from the target when attempting to establish a connection.
- The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target.
- The target response is malformed or contains HTTP headers that are not valid.
- A new target group was used but no targets have passed an initial health check yet. A target must pass one health check to be considered healthy
Because there's no any errors on Node.js side, so I'm guessing it was because of the keep-alive behaviour. There's no upstream keep-alive timeout settings in the current Nginx version (1.13.6), then I tried Tengine - a taobao forked nginx which support upstream keepalive timeout. After running a couple of days, there's no such errors any more. In order to understand what's causing the issue, I tried to reproduce it on my local machine.
Capture network packages
In order to send a request at the same time when Node.js closing the connection after keep-alive timeout, I need to keep requesting the url until the issue reproduced. Here's my settings for the testing environment:
Upstream (Node.js server):
- Set keep-alive timeout to 500 ms
Test client:
- Keep sending requests with an interval
- Interval starts from 500 ms and decrease 1 ms after each request
For the normal requests, upstream send a [FIN, ACK] to nginx after keep-alive timeout (500 ms), and nginx also send a [FIN, ACK] back, then upstream send a [ACK] to close the connection completely.
No. Time Source Destination Protocol Length Info
1 2017-11-12 17:11:04.299146 172.18.0.3 172.18.0.2 TCP 74 48528 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=0 WS=128
2 2017-11-12 17:11:04.299171 172.18.0.2 172.18.0.3 TCP 74 8000 → 48528 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=32031305 WS=128
3 2017-11-12 17:11:04.299194 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32031305 TSecr=32031305
4 2017-11-12 17:11:04.299259 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1
5 2017-11-12 17:11:04.299267 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32031305 TSecr=32031305
6 2017-11-12 17:11:04.299809 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html)
7 2017-11-12 17:11:04.299852 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32031305 TSecr=32031305
8 2017-11-12 17:11:04.800805 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [FIN, ACK] Seq=206 Ack=176 Win=30080 Len=0 TSval=32031355 TSecr=32031305
9 2017-11-12 17:11:04.801120 172.18.0.3 172.18.0.2 TCP 66 48528 → 8000 [FIN, ACK] Seq=176 Ack=207 Win=30336 Len=0 TSval=32031355 TSecr=32031355
10 2017-11-12 17:11:04.801151 172.18.0.2 172.18.0.3 TCP 66 8000 → 48528 [ACK] Seq=207 Ack=177 Win=30080 Len=0 TSval=32031355 TSecr=32031355
For the failed requests, the upstream closed the connection after keep-alive timeout (500 ms), the client sends a new http request before it receives and processes the [FIN] package. Because of the connection has been closed from upstream's perspective, so it send a [RST] response for this request. This would happen in following scenarios:
- Upstream hasn't send the [FIN] package yet (pending to send package at network layer)
- Upstream has sent the [FIN] package, but client hasn't received it yet
- Client received the [FIN] package, but hasn't processed it yet
Example of the first scenario, hasn't send [FIN] yet:
No. Time Source Destination Protocol Length Info
433 2017-11-12 17:11:26.548449 172.18.0.3 172.18.0.2 TCP 74 48702 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=0 WS=128
434 2017-11-12 17:11:26.548476 172.18.0.2 172.18.0.3 TCP 74 8000 → 48702 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=32033530 WS=128
435 2017-11-12 17:11:26.548502 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32033530 TSecr=32033530
436 2017-11-12 17:11:26.548609 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1
437 2017-11-12 17:11:26.548618 172.18.0.2 172.18.0.3 TCP 66 8000 → 48702 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32033530 TSecr=32033530
438 2017-11-12 17:11:26.549173 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html)
439 2017-11-12 17:11:26.549230 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32033530 TSecr=32033530
440 2017-11-12 17:11:27.049668 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1
441 2017-11-12 17:11:27.050324 172.18.0.2 172.18.0.3 HTTP 271 HTTP/1.1 200 OK (text/html)
442 2017-11-12 17:11:27.050378 172.18.0.3 172.18.0.2 TCP 66 48702 → 8000 [ACK] Seq=351 Ack=411 Win=31360 Len=0 TSval=32033580 TSecr=32033580
443 2017-11-12 17:11:27.551182 172.18.0.3 172.18.0.2 HTTP 241 GET /_healthcheck HTTP/1.1
444 2017-11-12 17:11:27.551294 172.18.0.2 172.18.0.3 TCP 66 8000 → 48702 [RST, ACK] Seq=411 Ack=526 Win=32256 Len=0 TSval=32033630 TSecr=32033630
Example of the second scenario, Sent [FIN] at the same time of receiving a new request:
No. Time Source Destination Protocol Length Info 13 2018-06-15 21:40:00.522110 127.0.0.1 127.0.0.1 TCP 68 50678 > 8000 [SYN] Seq=0 Win=65535 Len=0 MSS=16344 WS=32 TSval=1503957438 TSecr=0 SACK_PERM=1 14 2018-06-15 21:40:00.522210 127.0.0.1 127.0.0.1 TCP 68 8000 > 50678 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=32 TSval=1503957438 TSecr=1503957438 SACK_PERM=1 15 2018-06-15 21:40:00.522219 127.0.0.1 127.0.0.1 TCP 56 50678 > 8000 [ACK] Seq=1 Ack=1 Win=408288 Len=0 TSval=1503957438 TSecr=1503957438 16 2018-06-15 21:40:00.522228 127.0.0.1 127.0.0.1 TCP 56 [TCP Window Update] 8000 > 50678 [ACK] Seq=1 Ack=1 Win=408288 Len=0 TSval=1503957438 TSecr=1503957438 17 2018-06-15 21:40:00.522315 127.0.0.1 127.0.0.1 HTTP 189 GET / HTTP/1.1 18 2018-06-15 21:40:00.522358 127.0.0.1 127.0.0.1 TCP 56 8000 > 50678 [ACK] Seq=1 Ack=134 Win=408160 Len=0 TSval=1503957438 TSecr=1503957438 19 2018-06-15 21:40:00.522727 127.0.0.1 127.0.0.1 HTTP 261 HTTP/1.1 200 OK (text/html) 20 2018-06-15 21:40:00.522773 127.0.0.1 127.0.0.1 TCP 56 50678 > 8000 [ACK] Seq=134 Ack=206 Win=408064 Len=0 TSval=1503957438 TSecr=1503957438 21 2018-06-15 21:40:01.025685 127.0.0.1 127.0.0.1 HTTP 189 GET / HTTP/1.1 22 2018-06-15 21:40:01.025687 127.0.0.1 127.0.0.1 TCP 56 8000 > 50678 [FIN, ACK] Seq=206 Ack=134 Win=408160 Len=0 TSval=1503957939 TSecr=1503957438 23 2018-06-15 21:40:01.025748 127.0.0.1 127.0.0.1 TCP 44 8000 > 50678 [RST] Seq=206 Win=0 Len=0 24 2018-06-15 21:40:01.025760 127.0.0.1 127.0.0.1 TCP 56 50678 > 8000 [ACK] Seq=267 Ack=207 Win=408064 Len=0 TSval=1503957939 TSecr=1503957939 25 2018-06-15 21:40:01.025769 127.0.0.1 127.0.0.1 TCP 44 8000 > 50678 [RST] Seq=207 Win=0 Len=0
When the client receives the [RST] package, it will log a 'Connection reset' error.
Testing Code
This issue is a generic issue when closing the connection on the server side while HTTP keep-alive enabled, so you can easily reproduce it by clone the example code (Node.js) from https://github.com/weixu365/test-connection-reset
npm install npm start # In a separate terminal npm run client
- Upstream Node.js Web Server
const express = require('express');
const app = express();
app.get('/', (req, res) => res.send('OK'));
const port = process.env.PORT || 8000;
app.listen(port, () => {
console.log(`Listening on http://localhost:${port}`)
})
.keepAliveTimeout = 500;
- Test client in Node.js
const axios = require('axios');
const Bluebird = require('bluebird');
const HttpAgent = require('agentkeepalive');
const keepAliveOption = {
freeSocketKeepAliveTimeout: 30 * 1000, // Should be less than server keep alive timeout
socketActiveTTL: 50 * 1000 // Should be less than dns ttl
};
const httpAgent = new HttpAgent(keepAliveOption);
let host = 'http://localhost:8000';
let path = '/';
const httpClient = axios.create({
baseURL: host,
timeout: 5000,
});
const sendRequest = () =>
httpClient.request({
url: path,
httpAgent,
})
.then(res => {
console.log('Received response', res.status);
})
.catch(e => {
console.error('Error occurred', e.message);
});
let delay=501;
const start = () =>
sendRequest()
.then(() => delay -= 1)
.then(() => delay > 450 ? Bluebird.delay(delay).then(start) : 'Done')
start();
- Capture network packages
tcpdump -i eth0 tcp port 8000 -w /tmp/connection.pcap
- Nginx config (Optional)
upstream nodejs {
least_conn;
server chalice:8000 fail_timeout=1s max_fails=3;
keepalive 16;
}
server_tokens off;
log_format detailed escape=json
'{'
'"timestamp": "$time_iso8601",'
'"remote_addr": "$remote_addr",'
'"upstream_addr": "$upstream_addr",'
'"connection": "$connection",'
'"connection_requests": "$connection_requests",'
'"request_time": "$request_time",'
'"upstream_response_time": "$upstream_response_time",'
'"status": "$status",'
'"upstream_status": "$upstream_status",'
'"body_bytes_sent": "$body_bytes_sent ",'
'"request": "$request",'
'"http_user_agent": "$http_user_agent"'
'}';
access_log /var/log/nginx/access.log detailed;
server {
listen 80;
server_name localhost;
location / {
proxy_http_version 1.1;
proxy_redirect off;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarder-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_set_header X-NginX-Proxy true;
proxy_set_header Connection "";
proxy_pass http://nodejs;
}
}