Performance-tuning your API Gateway

admin 12 2025-01-09 编辑

Performance-tuning your  API Gateway

As a client-facing engineer at , I’m often asked “How fast is ’s API Gateway?”. My answer is always: “damn fast – and feature packed too!”. But how do you get the optimum performance from your installation?

In this post we’ll be walking through step-by-step how you can get some insanely low latency and high throughput out of a tiny commodity server. We will get down to the nitty gritty and show you some actual numbers to help effectively tune and size your Gateway installation.

I will be showing you how to:

  • Increase throughput from ~5000 requests per second to ~6400 requests per second
  • Reduce 95th %ile gateway latency by ~54% (9.2ms to 4.2ms)
  • Reduce CPU consumption by ~40%

In order to achieve all of the above, we will be tweaking just a single environment variable: GOGC.

We will be deploying Gateway, Redis Database, Upstream Server & Load Generator ALL to a single commodity 2-core DigitalOcean Droplet.

In client production deployments, we have seen Gateways performing TLS termination, Authentication, Rate-Limiting & Quota checking plus analytics recording alongside Statsd instrumentation – with the Gateway introducing just 1.5ms of latency at a sustained mission critical 60k transactions per second.

This performance blog is not supposed to be scientifically accurate – if you want to do that, feel free to re-create in a more sanitised environment and let me know your results over on the Community.

Does that sound like fun? Let’s get benching!

But first, how not to deploy

When you view showcase WordPress themes they are almost always gorilla-slow: that is, very large and with assets that are megabytes in size when the browser downloads. In real life, you will never use every single asset that the theme provides, you will likely minify and apply a few tweaks, only enabling what you need for the production site.

With this in mind, I’ve seen a couple of benchmarks of in the wild, and it is instantly clear that the deployment option chosen was a quick-start installation. While this is perfect for a few key use-cases (see next paragraph), it’s important to remember that, much like a WordPress theme, the easiest & fastest way to deploy is never going to be optimally configured.

The installation I refer to is the Pro Docker Demo, which is an all-in-one installation of the professional edition of . It is great at showing off features & functionality of as an enterprise API Management Platform, which can be deployed pretty much anywhere docker is available, but the amount of resource contention involved when benchmarking against this docker-compose micro-service architecture should not be surprising.

By design, is deployed as a collection of cloud-native micro-services, all working harmoniously together. Possibly the only micro-services architecture API management solution in the wild to manage your micro-services. As such, taking an out-of-the-box docker-compose to benchmark it without taking the time to understand how each component interacts with each other will almost certainly result in sub-optimal and inconsistent results. See our performance benchmarks or learn about API management.

Sizing & Configuration

This short section will provide some insights on how to go about sizing & its dependencies.

’s API Gateway is possibly the lightest and most performant solution in the industry, its only dependency being a small Redis Server.

That’s right – there is no need for MongoDB, Cassandra, Postgres or any of these heavy, computationally & memory expensive databases for an Open Source installation of .  In AWS, you could simply use ElastiCache, if GCP, you could use MemoryStore.

Redis

Redis requires enough ram to hold your tokens, temporary analytics and fast networking.

A default Redis installation will persist to disk in a non-optimal fashion, in doing so, is likely to choke the Gateway with the default configurations. For simplicity, we would recommend disabling AOF to reduce disk space usage. SAVE locks the database, and as such, BGSAVE will allow snapshotting in a non-blocking fashion.

In a real-world scenario, we recommend a Redis-cluster deployment, which will allow for linear performance improvements by sharding the data which is especially useful when you enable authentication, handle rate-limits and quotas. With Redis Cluster, you also get High Availability baked in with automatic failover. Introduction to Redis Cluster can easily speak with a Redis Cluster, or single Redis instance.

For the purposes of this article, and to illustrate my point, we are simply deploying a single default configuration Redis master node on the same tiny Digital Ocean droplet as the gateway.

Open Source API Gateway

’s open source API gateway requires CPU, a small amount of RAM and fast networking. The Gateway is multi-threaded, and as such, will by default, under load, consume all the CPU made available to it by the host operating system.

We would always recommend compute optimised nodes if deploying to a cloud provider like DO, Linode, AWS, Azure or GCP.  Cloud providers usually provide a good balance of RAM to CPU, and RAM should not really be a concern for to perform optimally, unless you intend to also run RAM heavy applications on the same node.

Once Gateways start consuming 60-70% CPU, we would recommend a scaling event. In the event that a node fails or is taken out for maintenance, then the remaining nodes in the cluster should have enough CPU remaining to handle the traffic from the failed node without suffering significant performance degradation.

Installation

We will deploy a commodity DigitalOcean server with 4GB RAM and 2 vCPUs for this benchmark. I will use a one-click-app with Docker deployed on Ubuntu 18.04 for speed of deployment.

Going against everything I have said above, I will be deploying upstream, load server, Gateway and Redis all on the same box.

This post is about how we can easily tune Gateway performance so the fact that all services are deployed on the same machine (resource contention) only proves that can go significantly faster if configured appropriately. Moving the target upstream and load generating servers to their own boxes, or ensuring appropriate resource limits if deploying inside Kubernetes for example.

Infrastructure / Application Setup

First, we install & startup NGinX which will be our baseline upstream server – you could use any performant upstream server you like really.

# install nginx apt install nginx systemctl start nginx

# Install Redis apt install redis-server systemctl start redis-server # Install Gateway apt install curl gnupg apt-transport-https curl -L https://packagecloud.io//-gateway/gpgkey | sudo apt-key add - cat > /etc/apt/sources.list.d/_-gateway.list <<- SOURCES deb https://packagecloud.io/app//-gateway/search?dist=ubuntu trusty main deb-src https://packagecloud.io/app//-gateway/search?dist=ubuntu trusty main SOURCES

apt update apt install -gateway systemctl start -gateway # check its up curl -i localhost:8080/hello HTTP/1.1 200 OK Date: Sun, 03 Feb 2019 22:25:21 GMT Content-Length: 10 Content-Type: text/plain; charset=utf-8 Hello Tiki

Gateway Proxy Configuration

# set env var to get ip address of eth1 easily export privateip=$(ifconfig eth1 | grep "inet " | awk '{print $2}') # configure a keyless API to override sample application cat > /opt/-gateway/apps/app_sample.json <<- EOF {    "name": "nginx",    "api_id": "nginx",    "org_id": "default",    "definition": {        "location": "",        "key": ""    },    "use_keyless": true,    "auth": {        "auth_header_name": ""    },    "version_data": {        "not_versioned": true,        "versions": {            "Default": {                "name": "Default",                "expires": "3000-01-02 15:04",                "use_extended_paths": true,                "extended_paths": {                    "ignored": [],                    "white_list": [],                    "black_list": []                }            }        }    },    "proxy": {        "listen_path": "/nginx/",        "target_url": "https://$privateip/",        "strip_listen_path": true    },    "do_not_track": true } EOF # reload Gateway configurations via it’s control API curl -H "x--authorization: 352d20ee67be67f6340b4c0605b044b7" \ localhost:8080//reload/ # or just restart the service given that we are not in production systemctl restart -gateway # test it curl -i localhost:8080/nginx/ HTTP/1.1 200 OK Content-Type: text/html Date: Sun, 03 Feb 2019 22:38:53 GMT Etag: W/"5c5747c8-264" Last-Modified: Sun, 03 Feb 2019 19:58:00 GMT Server: nginx/1.14.0 (Ubuntu) X-Ratelimit-Limit: 0 X-Ratelimit-Remaining: 0 X-Ratelimit-Reset: 0 Content-Length: 612 <!DOCTYPE html> <html> --- SNIP ---

As you can see, this is an out-of-the-box installation.

Setting up benchmarking

In order to benchmark , we will use docker-hey https://hub.docker.com/r/rcmorano/docker-hey

# fix firewall rules to allow docker0 access to host machine iptables -A INPUT -i docker0 -j ACCEPT docker pull rcmorano/docker-hey

Baseline NginX Upstream

# benchmark nginx directly to act as a baseline

docker run --rm -it rcmorano/docker-hey -z 30s https://$privateip/ Summary:  Total: 30.0029 secs  Slowest: 0.0305 secs  Fastest: 0.0001 secs  Average: 0.0029 secs  Requests/sec: 17244.7455 Response time histogram:  0.000 [1] |  0.003 [303947] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  0.006 [193680] |■■■■■■■■■■■■■■■■■■■■■■■■■  0.009 [16511] |■■  0.012 [2364] |  0.015 [562] |  0.018 [202] |  0.021 [38] |  0.024 [66] |  0.027 [20] |  0.031 [1] | Latency distribution:  10% in 0.0007 secs  25% in 0.0016 secs  50% in 0.0027 secs  75% in 0.0040 secs  90% in 0.0050 secs  95% in 0.0058 secs  99% in 0.0083 secs Details (average, fastest, slowest):  DNS+dialup: 0.0000 secs, 0.0001 secs, 0.0305 secs  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs  req write: 0.0000 secs, 0.0000 secs, 0.0225 secs  resp wait: 0.0026 secs, 0.0000 secs, 0.0241 secs  resp read: 0.0002 secs, 0.0000 secs, 0.0220 secs Status code distribution:  [200] 517392 responses

So, out of the box, NginX performs as follows – even though docker-hey is also consuming CPU and eating system resources.

Slowest 30.5ms
Average 2.9ms
Median 2.7ms
95%ile 5.8ms

Baseline Reverse Proxy to NginX

So now that we have our baseline, we should see how out of the box performs as a transparent reverse proxy to NginX – even with some resource contention.

docker run --rm -it rcmorano/docker-hey -z 30s https://$privateip:8080/nginx/ Summary:  Total: 30.0035 secs  Slowest: 0.0768 secs  Fastest: 0.0003 secs  Average: 0.0099 secs  Requests/sec: 5056.8813 Response time histogram:  0.000 [1] |  0.008 [67948] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  0.016 [58333] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  0.023 [22119] |■■■■■■■■■■■■■  0.031 [2519] |■  0.039 [544] |  0.046 [198] |  0.054 [33] |  0.062 [18] |  0.069 [6] |  0.077 [5] | Latency distribution:  10% in 0.0034 secs  25% in 0.0055 secs  50% in 0.0087 secs  75% in 0.0135 secs  90% in 0.0177 secs  95% in 0.0203 secs  99% in 0.0268 secs Details (average, fastest, slowest):  DNS+dialup: 0.0000 secs, 0.0003 secs, 0.0768 secs  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs  req write: 0.0000 secs, 0.0000 secs, 0.0022 secs  resp wait: 0.0098 secs, 0.0002 secs, 0.0768 secs  resp read: 0.0000 secs, 0.0000 secs, 0.0054 secs Status code distribution:  [200] 151724 responses
Baseline Latency Latency (ms) Introduced Latency (ms)
Slowest 30.5 76.8 +46.3
Average 2.9 9.9 +7
Median 2.7 8.7 +6
95%ile 5.8 20.3 +14.5

Not bad – is able to handle 5000 rps out of the box despite the resource contention – this tiny machine is working very hard and all the CPU is being consumed.

The problem we have here – is that our test is extremely aggressive – the average/median latency is acceptable, however there are some outliers which causes average latency to increase. The slowest request introduces 46ms latency which for me is unacceptable.

Resource Contention

Now in a real world scenario, you simply wouldn’t hammer the Gateway – the added latency for some of the requests is just not acceptable. Let’s throttle our rps a little.

By default, docker-hey has 50 concurrent clients. The Gateway can handle 5k rps max in this setup. So by turning down the requests per second by just 5%, let’s see how latency improves, ensuring that requests come back in a timely and consistent fashion. We will set the rate limit per consumer to 95.

Rate limit per consumer = 5000/50 * 95% = 95

docker run --rm -it rcmorano/docker-hey -z 30s -q 95 https://$privateip:8080/nginx/ Summary:  Total: 30.0138 secs  Slowest: 0.0506 secs  Fastest: 0.0002 secs  Average: 0.0069 secs  Requests/sec: 4713.9667 Response time histogram:  0.000 [1] |  0.005 [58984] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  0.010 [57735] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  0.015 [18420] |■■■■■■■■■■■■  0.020 [5468] |■■■■  0.025 [740] |■  0.030 [110] |  0.036 [22] |  0.041 [2] |  0.046 [1] |  0.051 [1] | Latency distribution:  10% in 0.0027 secs  25% in 0.0040 secs  50% in 0.0060 secs  75% in 0.0090 secs  90% in 0.0127 secs  95% in 0.0150 secs  99% in 0.0192 secs Details (average, fastest, slowest):  DNS+dialup: 0.0000 secs, 0.0002 secs, 0.0506 secs  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs  req write: 0.0000 secs, 0.0000 secs, 0.0032 secs  resp wait: 0.0069 secs, 0.0002 secs, 0.0506 secs  resp read: 0.0000 secs, 0.0000 secs, 0.0121 secs Status code distribution:  [200] 141484 responses

Whilst we lost just shy of 350 requests per second, look what we gained!

Baseline Latency Latency (ms) Introduced Latency (ms)
Slowest 30.5 50.6 +20.1
Average 2.9 6.9 +4
Median 2.7 6.0 +3.3
95%ile 5.8 15.0 +9.2

Latency is now at a much more acceptable level – by not overloading the CPU. Notice how we have still not actually done any tuning, but have still managed to reduce CPU consumption by more than 10%.

Gateway is written in Go

Unlike various API Management solutions is written in pure Go. Go is a modern systems programming language, and is being actively developed by Google, to solve Google scale problems. does not rely on 3rd party software such as NginX to do all the heavy lifting. Go provides both Type safety, and also offers a Garbage Collector.

The Go programming language has an enviably performant concurrent, tri-colour, mark sweep garbage collector first proposed by Edsger W. Dijkstra in 1978. If you want the details, https://dl.acm.org/citation.cfm?id=359655. At a very high-level, stop-the-world GC pauses are now well below 10ms.

Despite the awesome GC, at such high throughput, compute is now being spent on scheduling the garbage collector, and collecting that garbage.

At the expense of a little extra RAM usage, we can tweak the scheduler so that it runs less often. The default value is 100%, and each time we double it, the collection frequency is halved. I’m going to set the value to 3200%. This can be achieved simply by setting a system environment variable GOGC.

It’s worth playing around with this figure for specific use-case – don’t set it too high, or you will run out of RAM and crash the VM. There will come a point where you don’t get noticeable gains. I doubled the GOGC value each test to find what the optimum figure was and that’s how I arrived at the 3200% figure.

Tuning ’s Garbage Collector

vi /lib/systemd/system/-gateway.service # add this line inside the [Service] directive Environment="GOGC=3200" # save, quit and restart daemon and service systemctl daemon-reload systemctl restart -gateway

 

Let’s re-run the benchmark

docker run --rm -it rcmorano/docker-hey -z 30s -q 95 https://$privateip:8080/nginx/ Summary:  Total: 30.0065 secs  Slowest: 0.0425 secs  Fastest: 0.0003 secs  Average: 0.0048 secs  Requests/sec: 4739.7088 Response time histogram:  0.000 [1] |  0.004 [69149] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  0.009 [69266] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  0.013 [2489] |■  0.017 [841] |  0.021 [373] |  0.026 [67] |  0.030 [16] |  0.034 [4] |  0.038 [9] |  0.043 [7] | Latency distribution:  10% in 0.0027 secs  25% in 0.0035 secs  50% in 0.0045 secs  75% in 0.0056 secs  90% in 0.0067 secs  95% in 0.0077 secs  99% in 0.0126 secs Details (average, fastest, slowest):  DNS+dialup: 0.0000 secs, 0.0003 secs, 0.0425 secs  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs  req write: 0.0000 secs, 0.0000 secs, 0.0045 secs  resp wait: 0.0047 secs, 0.0002 secs, 0.0425 secs  resp read: 0.0000 secs, 0.0000 secs, 0.0052 secs Status code distribution:  [200] 142222 responses

Holy Cow – we just dropped from 82% CPU usage down to less than 60%!

Holy Cow – we just dropped from 82% CPU usage down to less than 60%! We are now using 26% less CPU than the previous test and about 40% less CPU than our first bench.

is still handling the same RPS, but rather than waste compute on collecting garbage, the Gateway is being able to proxy requests.

Baseline Latency Latency (ms) Introduced Latency (ms)
Slowest 30.5 42.5 +12
Average 2.9 4.8 +1.9
Median 2.7 4.5 +1.8
95%ile 5.8 7.7 +1.9

The slowest request is just 12ms slower than the slowest baseline request. is now adding sub 2ms 95%ile latency.

At 60-70% CPU usage we would recommend adding another to the cluster. But this is a benchmarking blog – so let’s up the RPS. Let’s try full speed again to see what can handle when maxing out.

docker run --rm -it rcmorano/docker-hey -z 30s https://$privateip:8080/nginx/ Summary:  Total: 30.0050 secs  Slowest: 0.0955 secs  Fastest: 0.0003 secs  Average: 0.0072 secs  Requests/sec: 6897.5780 Response time histogram:  0.000 [1] |  0.010 [165543] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  0.019 [38325] |■■■■■■■■■  0.029 [2665] |■  0.038 [332] |  0.048 [75] |  0.057 [6] |  0.067 [8] |  0.076 [2] |  0.086 [3] |  0.096 [2] | Latency distribution:  10% in 0.0030 secs  25% in 0.0046 secs  50% in 0.0066 secs  75% in 0.0091 secs  90% in 0.0120 secs  95% in 0.0144 secs  99% in 0.0212 secs Details (average, fastest, slowest):  DNS+dialup: 0.0000 secs, 0.0003 secs, 0.0955 secs  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs  req write: 0.0000 secs, 0.0000 secs, 0.0042 secs  resp wait: 0.0072 secs, 0.0002 secs, 0.0955 secs  resp read: 0.0000 secs, 0.0000 secs, 0.0059 secs Status code distribution:  [200] 206962 responses  req write: 0.0000 secs, 0.0000 secs, 0.0078 secs  resp wait: 0.0072 secs, 0.0002 secs, 0.0684 secs  resp read: 0.0000 secs, 0.0000 secs, 0.0069 secs Status code distribution:  [200] 205300 responses

So the Gateway is now handling 6900 rps rather than the original 5000 rps (27% increase), and latency is still pretty good adding just 8.6ms at the 95th %ile – That’s a massive 40% improvement in latency from the out-of-the-box installation of .

Baseline Latency Latency (ms) Introduced Latency (ms)
Slowest 30.5 95.5 +65.5
Average 2.9 7.2 +4.3
Median 2.7 6.6 +3.9
95%ile 5.8 14.4 +8.6

By dialling back the throughput once again by about 5% so that we are not overloading the CPUs – we set q to 131 (6900/50*0.95).

docker run --rm -it rcmorano/docker-hey -z 30s -q 131 https://$privateip:8080/nginx/ Summary:  Total: 30.0058 secs  Slowest: 0.0473 secs  Fastest: 0.0003 secs  Average: 0.0052 secs  Requests/sec: 6412.0916 Response time histogram:  0.000 [1] |  0.005 [105806] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  0.010 [76053] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  0.014 [7891] |■■■  0.019 [1834] |■  0.024 [563] |  0.028 [194] |  0.033 [52] |  0.038 [4] |  0.043 [1] |  0.047 [1] | Latency distribution:  10% in 0.0027 secs  25% in 0.0036 secs  50% in 0.0047 secs  75% in 0.0060 secs  90% in 0.0079 secs  95% in 0.0100 secs  99% in 0.0158 secs Details (average, fastest, slowest):  DNS+dialup: 0.0000 secs, 0.0003 secs, 0.0473 secs  DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs  req write: 0.0000 secs, 0.0000 secs, 0.0026 secs  resp wait: 0.0051 secs, 0.0002 secs, 0.0473 secs  resp read: 0.0000 secs, 0.0000 secs, 0.0034 secs Status code distribution:  [200] 192400 responses
Baseline Latency Total Latency (ms) Introduced Latency (ms)
Slowest 30.5 47.3 +17
Average 2.9 5.2 +2.3
Median 2.7 4.7 +2.0
95%ile 5.8 10.0 +4.2

CPU is at a much healthier 80%, we are still handling 6400 requests per second (1400 rps more than original) and are introducing just 2.3ms average latency or 4.2ms latency at the 95th %ile. Therefore a 54% improvement on the default Installation.

Summary

It is important to remember that everything has been deployed on a 2-virtual-core commodity server with just 4GB of RAM.

Because we still have significant resource contention from the upstream server, load server, Redis and the Gateway all deployed on the same box, there is no way of knowing the true performance of the gateway without splitting out the components onto their own machines, or dedicating compute to each component.

Despite all of this, by tweaking a single environment variable, has been able to handle 6400 requests per second. 95% of requests come back within 10ms and subtracting NginX’s baseline latency, has introduced just 4.2ms 95th %ile latency to the request.

To top that, all the services combined are only consuming 80% CPU and less than 600Mb RAM. Further optimisations can be made to configurations to tune further in accordance with your specific use-case.

Want to give it a go? We’d love to hear your results! Drop us a line via the Community to share your own benchmarks, and help us continue to improve performance.

Happy benching!


Performance-tuning your API Gateway

上一篇: Understanding the Significance of 3.4 as a Root in Mathematics
下一篇: How to efficiently scale your workload in Kubernetes using observability
相关文章