Analyze 'Connection reset' error in Nginx upstream with keep-alive enabled

What? Connection reset by peer?

We are running Node.js web services behind AWS Classic Load Balancer. I noticed that many 502 errors after I migrate AWS Classic Load Balancer to Application Load Balancer. In order to understand what happened, I added Nginx in front of the Node.js web server, and then found that there are more than 100 'connection reset' errors everyday in Nginx logs.

Here are some example logs:

2017/11/12 06:11:15 [error] 7#7: *2904 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:27 [error] 7#7: *2950 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:31 [error] 7#7: *2962 upstream prematurely closed connection while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:44 [error] 7#7: *3005 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
2017/11/12 06:11:47 [error] 7#7: *3012 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"

Analyzing the errors

The number of errors was increased after I migrate Classic LB to Application LB, and one of the differences between them is Classic LB is using pre-connected connections, and Application LB only using Http/1.1 Keep-Alive feature.

From the documentation of AWS Load Balancer:

Possible causes:
  • The load balancer received a TCP RST from the target when attempting to establish a connection.
  • The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target.
  • The target response is malformed or contains HTTP headers that are not valid.
  • A new target group was used but no targets have passed an initial health check yet. A target must pass one health check to be considered healthy

Because there's no any errors on Node.js side, so I'm guessing it was because of the keep-alive behaviour. There's no upstream keep-alive timeout settings in the current Nginx version (1.13.6), then I tried Tengine - a taobao forked nginx which support upstream keepalive timeout. After running a couple of days, there's no such errors any more. In order to understand what's causing the issue, I tried to reproduce it on my local machine.

Capture network packages

In order to send a request at the same time when Node.js closing the connection after keep-alive timeout, I need to keep requesting the url until the issue reproduced. Here's my settings for the testing environment:

Upstream (Node.js server):

  • Set keep-alive timeout to 500 ms

Test client:

  • Keep sending requests with an interval
  • Interval starts from 500 ms and decrease 1 ms after each request

For the normal requests, upstream send a [FIN, ACK] to nginx after keep-alive timeout (500 ms), and nginx also send a [FIN, ACK] back, then upstream send a [ACK] to close the connection completely.

No.     Time                          Source                Destination           Protocol Length Info
      1 2017-11-12 17:11:04.299146    172.18.0.3            172.18.0.2            TCP      74     48528 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=0 WS=128
      2 2017-11-12 17:11:04.299171    172.18.0.2            172.18.0.3            TCP      74     8000 → 48528 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32031305 TSecr=32031305 WS=128
      3 2017-11-12 17:11:04.299194    172.18.0.3            172.18.0.2            TCP      66     48528 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32031305 TSecr=32031305
      4 2017-11-12 17:11:04.299259    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
      5 2017-11-12 17:11:04.299267    172.18.0.2            172.18.0.3            TCP      66     8000 → 48528 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32031305 TSecr=32031305
      6 2017-11-12 17:11:04.299809    172.18.0.2            172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
      7 2017-11-12 17:11:04.299852    172.18.0.3            172.18.0.2            TCP      66     48528 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32031305 TSecr=32031305
      8 2017-11-12 17:11:04.800805    172.18.0.2            172.18.0.3            TCP      66     8000 → 48528 [FIN, ACK] Seq=206 Ack=176 Win=30080 Len=0 TSval=32031355 TSecr=32031305
      9 2017-11-12 17:11:04.801120    172.18.0.3            172.18.0.2            TCP      66     48528 → 8000 [FIN, ACK] Seq=176 Ack=207 Win=30336 Len=0 TSval=32031355 TSecr=32031355
     10 2017-11-12 17:11:04.801151    172.18.0.2            172.18.0.3            TCP      66     8000 → 48528 [ACK] Seq=207 Ack=177 Win=30080 Len=0 TSval=32031355 TSecr=32031355

For the failed requests, the upstream closed the connection after keep-alive timeout (500 ms), the client sends a new http request before it receives and processes the [FIN] package. Because of the connection has been closed from upstream's perspective, so it send a [RST] response for this request. This would happen in following scenarios:

  • Upstream hasn't send the [FIN] package yet (pending to send package at network layer)
  • Upstream has sent the [FIN] package, but client hasn't received it yet
  • Client received the [FIN] package, but hasn't processed it yet

Example of the first scenario, hasn't send [FIN] yet:

No.     Time                          Source                Destination           Protocol Length Info
    433 2017-11-12 17:11:26.548449    172.18.0.3            172.18.0.2            TCP      74     48702 → 8000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=0 WS=128
    434 2017-11-12 17:11:26.548476    172.18.0.2            172.18.0.3            TCP      74     8000 → 48702 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=32033530 TSecr=32033530 WS=128
    435 2017-11-12 17:11:26.548502    172.18.0.3            172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=32033530 TSecr=32033530
    436 2017-11-12 17:11:26.548609    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
    437 2017-11-12 17:11:26.548618    172.18.0.2            172.18.0.3            TCP      66     8000 → 48702 [ACK] Seq=1 Ack=176 Win=30080 Len=0 TSval=32033530 TSecr=32033530
    438 2017-11-12 17:11:26.549173    172.18.0.2            172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
    439 2017-11-12 17:11:26.549230    172.18.0.3            172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=176 Ack=206 Win=30336 Len=0 TSval=32033530 TSecr=32033530
    440 2017-11-12 17:11:27.049668    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
    441 2017-11-12 17:11:27.050324    172.18.0.2            172.18.0.3            HTTP     271    HTTP/1.1 200 OK  (text/html)
    442 2017-11-12 17:11:27.050378    172.18.0.3            172.18.0.2            TCP      66     48702 → 8000 [ACK] Seq=351 Ack=411 Win=31360 Len=0 TSval=32033580 TSecr=32033580
    443 2017-11-12 17:11:27.551182    172.18.0.3            172.18.0.2            HTTP     241    GET /_healthcheck HTTP/1.1 
    444 2017-11-12 17:11:27.551294    172.18.0.2            172.18.0.3            TCP      66     8000 → 48702 [RST, ACK] Seq=411 Ack=526 Win=32256 Len=0 TSval=32033630 TSecr=32033630

Example of the second scenario, Sent [FIN] at the same time of receiving a new request:

No. Time    Source  Destination Protocol    Length  Info
13  2018-06-15 21:40:00.522110  127.0.0.1   127.0.0.1   TCP 68  50678  >  8000 [SYN] Seq=0 Win=65535 Len=0 MSS=16344 WS=32 TSval=1503957438 TSecr=0 SACK_PERM=1
14  2018-06-15 21:40:00.522210  127.0.0.1   127.0.0.1   TCP 68  8000  >  50678 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=32 TSval=1503957438 TSecr=1503957438 SACK_PERM=1
15  2018-06-15 21:40:00.522219  127.0.0.1   127.0.0.1   TCP 56  50678  >  8000 [ACK] Seq=1 Ack=1 Win=408288 Len=0 TSval=1503957438 TSecr=1503957438
16  2018-06-15 21:40:00.522228  127.0.0.1   127.0.0.1   TCP 56  [TCP Window Update] 8000  >  50678 [ACK] Seq=1 Ack=1 Win=408288 Len=0 TSval=1503957438 TSecr=1503957438
17  2018-06-15 21:40:00.522315  127.0.0.1   127.0.0.1   HTTP    189 GET / HTTP/1.1 
18  2018-06-15 21:40:00.522358  127.0.0.1   127.0.0.1   TCP 56  8000  >  50678 [ACK] Seq=1 Ack=134 Win=408160 Len=0 TSval=1503957438 TSecr=1503957438
19  2018-06-15 21:40:00.522727  127.0.0.1   127.0.0.1   HTTP    261 HTTP/1.1 200 OK  (text/html)
20  2018-06-15 21:40:00.522773  127.0.0.1   127.0.0.1   TCP 56  50678  >  8000 [ACK] Seq=134 Ack=206 Win=408064 Len=0 TSval=1503957438 TSecr=1503957438
21  2018-06-15 21:40:01.025685  127.0.0.1   127.0.0.1   HTTP    189 GET / HTTP/1.1 
22  2018-06-15 21:40:01.025687  127.0.0.1   127.0.0.1   TCP 56  8000  >  50678 [FIN, ACK] Seq=206 Ack=134 Win=408160 Len=0 TSval=1503957939 TSecr=1503957438
23  2018-06-15 21:40:01.025748  127.0.0.1   127.0.0.1   TCP 44  8000  >  50678 [RST] Seq=206 Win=0 Len=0
24  2018-06-15 21:40:01.025760  127.0.0.1   127.0.0.1   TCP 56  50678  >  8000 [ACK] Seq=267 Ack=207 Win=408064 Len=0 TSval=1503957939 TSecr=1503957939
25  2018-06-15 21:40:01.025769  127.0.0.1   127.0.0.1   TCP 44  8000  >  50678 [RST] Seq=207 Win=0 Len=0

When the client receives the [RST] package, it will log a 'Connection reset' error.

Testing Code

This issue is a generic issue when closing the connection on the server side while HTTP keep-alive enabled, so you can easily reproduce it by clone the example code (Node.js) from https://github.com/weixu365/test-connection-reset

npm install
npm start

# In a separate terminal
npm run client
  • Upstream Node.js Web Server
const express = require('express');
 
const app = express();
 
app.get('/', (req, res) => res.send('OK'));
 
const port = process.env.PORT || 8000;
app.listen(port, () => {
  console.log(`Listening on http://localhost:${port}`)
})
  .keepAliveTimeout = 500;
  • Test client in Node.js
const axios = require('axios');
const Bluebird = require('bluebird');
const HttpAgent = require('agentkeepalive');

const keepAliveOption = {
  freeSocketKeepAliveTimeout: 30 * 1000, // Should be less than server keep alive timeout
  socketActiveTTL: 50 * 1000 // Should be less than dns ttl
};
const httpAgent = new HttpAgent(keepAliveOption);

let host = 'http://localhost:8000';
let path = '/';

const httpClient = axios.create({
  baseURL: host,
  timeout: 5000,
});

const sendRequest = () =>
  httpClient.request({
    url: path,
    httpAgent,
  })
    .then(res => {
      console.log('Received response', res.status);
    })
    .catch(e => {
      console.error('Error occurred', e.message);
    });

let delay=501;
const start = () =>
  sendRequest()
    .then(() => delay -= 1)
    .then(() => delay > 450 ? Bluebird.delay(delay).then(start) : 'Done')

start();
  • Capture network packages
tcpdump -i eth0 tcp port 8000 -w /tmp/connection.pcap
  • Nginx config (Optional)
upstream nodejs {
  least_conn;
  server chalice:8000 fail_timeout=1s max_fails=3;
  keepalive 16;
}

server_tokens off;
log_format detailed escape=json
  '{' 
  '"timestamp": "$time_iso8601",'
  '"remote_addr": "$remote_addr",'
  '"upstream_addr": "$upstream_addr",'
  '"connection": "$connection",'
  '"connection_requests": "$connection_requests",'
  '"request_time": "$request_time",'
  '"upstream_response_time": "$upstream_response_time",'
  '"status": "$status",'
  '"upstream_status": "$upstream_status",'
  '"body_bytes_sent": "$body_bytes_sent ",'
  '"request": "$request",'
  '"http_user_agent": "$http_user_agent"'
  '}';
access_log /var/log/nginx/access.log detailed;

server {
  listen 80;
  server_name localhost;

  location / {
    proxy_http_version 1.1;
    proxy_redirect off;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarder-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;
    proxy_set_header X-NginX-Proxy true;
    proxy_set_header Connection "";

    proxy_pass http://nodejs;
  }
}

References: