Featured

    Analyze 'Connection reset' error in Nginx upstream with keep-alive enabled

    What? Connection reset by peer?

    We are running Node.js web services behind AWS Classic Load Balancer. I noticed that many 502 errors after I migrate AWS Classic Load Balancer to Application Load Balancer. In order to understand what happened, I added Nginx in front of the Node.js web server, and then found that there are more than 100 'connection reset' errors everyday in Nginx logs.

    Here are some example logs:

    2017/11/12 06:11:15 [error] 7#7: *2904 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
    2017/11/12 06:11:27 [error] 7#7: *2950 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
    2017/11/12 06:11:31 [error] 7#7: *2962 upstream prematurely closed connection while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
    2017/11/12 06:11:44 [error] 7#7: *3005 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
    2017/11/12 06:11:47 [error] 7#7: *3012 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "GET /_healthcheck HTTP/1.1", upstream: "http://172.18.0.2:8000/_healthcheck", host: "localhost"
    

    Analyzing the errors

    The number of errors was increased after I migrate Classic LB to Application LB, and one of the differences between them is Classic LB is using pre-connected connections, and Application LB only using Http/1.1 Keep-Alive feature.

    From the documentation of AWS Load Balancer:

    Possible causes:
    • The load balancer received a TCP RST from the target when attempting to establish a connection.
    • The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target.
    • The target response is malformed or contains HTTP headers that are not valid.
    • A new target group was used but no targets have passed an initial health check yet. A target must pass one health check to be considered healthy

    Continue reading

    Featured

    Notes for playing with ptrace on 64 bits Ubuntu 12.10

    This blog is the notes during I learning the "Playing with ptrace"(http://www.linuxjournal.com/article/6100).

    The original examples was using 32 bits machine, which doesn't work on my 64 bits Ubuntu 12.10.

    Let's start from the first ptrace example:

    #include <sys/ptrace.h>
    #include <sys/types.h>
    #include <sys/wait.h>
    #include <unistd.h>
    #include <linux/user.h>   /* For constants
                                       ORIG_EAX etc */
    int main()
    {   pid_t child;
        long orig_eax;
        child = fork();
        if(child == 0) {
            ptrace(PTRACE_TRACEME, 0, NULL, NULL);
            execl("/bin/ls", "ls", NULL);
        }
        else {
            wait(NULL);
            orig_eax = ptrace(PTRACE_PEEKUSER,
                              child, 4 * ORIG_EAX,
                              NULL);
            printf("The child made a "
                   "system call %ldn", orig_eax);
            ptrace(PTRACE_CONT, child, NULL, NULL);
        }
        return 0;
    }
    

    The compiler shows the following error:

    fatal error: 'linux/user.h' file not found
    #include <linux/user.h>
    

    Something need to change because of:

    1. The 'linux/user.h' no longer exists
    2. The 64 bits register is R*X, so EAX changed to RAX

    There are two solutions to fix this: Continue reading

    Metrics Driven Development - What I did to reduce AWS EC2 costs to 27% and improve 25% in latency

    Recently, I did some work related to auto-scaling and performance tuning. As a result, the costs reduced to 27% and service latency improved 25%.

    Overall Instance Count And Service Latency Change
    Overall Instance Count And Service Latency Change

    Takeaways

    • React Server Side Render performs not good under Nodejs Cluster, consider using a reverse proxy, e.g. Nginx
    • React V16 Server Side Render performs much faster than V15, 40% in our case
    • Use smaller instances to get better scaling granularity if possible, e.g. change C4.2xLarge to C4.Large
    • AWS t2.large performs 3 times slower than C4.large on React Server Side Render
    • AWS Lambda performs 3 times slower than C4.large on React Server Side Render
    • There's a race condition in Nginx http upstream keepalive module which generates 502 Bad Gateway errors (104 connection reset by peer)

    Background

    Here's the background of the service before optimization:

    • Serving 6000 requests per minute
    • Using AWS Classic Load Balancer
    • Running 25 C3.2xLarge EC2 instances which have 8-core CPU on each instance
    • Using PM2 as the Process Manager and the Cluster Manager
    • Written in Nodejs and using React 15 server-side render Continue reading

    Improve performance of React Server Side Render by warming up service

    Background

    We're using ElasticBeanstalk Blue/Green deployment by swapping DNS, and in front of the deployed services, there's a company level Nginx to forward requests to different services. The TTL of DNS entry set to 1 minute, and Nginx will cache the resolved name entries until it expired. Each time after we deploy, all the requests hit the new environment after the DNS cache expired in Nginx.

    React Server Render time without warming up

    The response time increases a lot in the following a couple of minutes and becomes stable after 5 minutes. Because the response time impacted by upstream services, it's better to analyze and improve react server render time which is a sync method call and not involve IO operations.

    Here's the initial reactServerRender time for Home Page and Search Result Page:

    For the Home page, it took 2-3 minutes for the reactRender time reduced from 450 - 550 ms to 120 ms

    Continue reading

    How To: Create Subscription Filter in CloudWatch using serverless

    Recently, I worked on a task which need to collect all CloudWatch logs to a Kinesis stream. The project is using Serverless for deployment. There are some plugins to create CloudWatch Log subscription filter, but none of them using Kinesis as the destination.

    Then by using the serverless-scriptable-plugin, I'm able to do this very easily. The following code find out all CloudWatch LogGroups, and create a SubscriptionFilter for each of them.

    Create a file at build/serverless/add-log-subscriptions.js Continue reading

    Troubleshooting of blocked requests when fetching messages from hundreds SQS queues

    I'm working on a project which needs to fetch messages from hundreds of SQS queues. We're using SQS long polling to reduce the number of empty responses. It was very quick to get response at first when there are only dozen queues. As we added more and more queues, the performance getting worse and worse. It takes 60 seconds to get the response when there's 300 queues and WaitTimeSeconds set to 10 seconds.

    We are using Node.js in single thread mode, and I believe that it could handle 10 thousands connections without any problem because most of the tasks are IO processing. We also created an AWS support case, but nobody clearly answered the question.

    Using AWS SDK to reproduce the issue

    I start to troubleshoot the problem, the first step is reproduce the issue using a simple code which makes it easier to find out the issue. Continue reading

    How to: convert asciidoc book to epub/mobi formats

    Many open source books are written in asciidoc, in order to read the book in kindle, I have to convert it to mobi file. Here's a quick note on how to convert files.

    1. Use asciidoctor to convert asciidoc book to docbook
    2. Use pandoc to convert to epub
    3. Use Calibre to convert epub to mobi
    asciidoctor -d book -b docbook5 book.asciidoc -o output.docbook
    pandoc -f docbook -t epub output.docbook -o book.epub
    

    Then use Calibre to convert epub to mobi files

    Aws Lambda retry behaviours on stream-based event sources

    From the documentation, AWS Lambda will retry failed function on stream-based events sources.

    By using Node.js, we can fail the function by many different ways, e.g. using callback to return error, throw exception directly, throw exception inside Promise, using Promise.reject. Then the questions is, what's the proper way to let AWS Lambda know it needs a retry?

    I did an quick test on following scenarios by setting up DynamoDB Stream and event mappings. It's fun to have a guess which one will be retried and which one won't.

    Different ways to end the function

    • On Exception
    module.exports.throwException = (event, context) => {
      console.log(JSON.stringify(event));
      throw new Error('something wrong');
    };
    

    Continue reading

    How to clean up branches in local Git repository

    The git branches in local repository will grow rapidly if you are using branch development, e.g. always create a new branch for any new feature/story/bug-fix.

    The branch becomes useless after it merged to master, here's some commands to clean up branches in local repository.

    Remove remote branches which have been deleted on remote repository

    Using any of the following commands to remove branches which have been deleted on remote repository, e.g. branch deleted after merge to master

      git fetch --prune
      git pull --prune
    

    If you are on git 1.8.5+ you can set this action to happen during each pull

    git config fetch.prune true
    

    Remove local branches which not following remote branch

    List local branches which not following remote branch

    git branch -vv | grep '\[origin/[^ ]*: gone\]' \
      | tr -s ' ' | cut -d' ' -f2 | grep -v master
    

    Remove the above branches

    git branch -vv | grep '\[origin/[^ ]*: gone\]' \
      | tr -s ' ' | cut -d' ' -f2 | grep -v master \
      | xargs git branch -d 
    

    Capture console output when using child_process.execSync in node.js

    I'm working on a nodejs project recently, and need to execute a command by using child_process.execSync().

    The ideal solution should:

    • Support color output from command, which generally means the stdout/stderr should be a tty
    • Be testable, which means I can get the command output and verify

    From the node.js documentation, I can use options.stdio to change stdio fds in child process.

    For convenience, options.stdio may be one of the following strings:

    'pipe' - equivalent to ['pipe', 'pipe', 'pipe'] (the default) 'ignore' - equivalent to ['ignore', 'ignore', 'ignore'] 'inherit' - equivalent to [process.stdin, process.stdout, process.stderr] or [0,1,2]

    I started from the default behaviour - using pipe, which returns the console output after child process finishes. But it doesn't support colours by default because piped stdout/stderr is not a tty. Continue reading