Dockerized Redis performance on CentOS 7.5

7.2.2019 Research

ThreatMark AFS (Anti Fraud Suite) is a system that delivers real-time insights on the user behavior and risk associated with every user action within digital banking or similar applications. Similarly to other enterprise systems, AFS uses several open source components. The usage of such components naturally brings challenges around its deployment, maintenance, and performance.

Redis is one of these components. AFS relies on Redis as a cache and storage for session data, making it a critical piece of our infrastructure. Previously Redis was deployed, along with database, on bare OS (virtualized or bare-metal).

In an effort to converge to a cloud native infrastructure, we have decided that Redis is a good candidate for dockerization. The initial step is borderline trivial – Redis is already published on docker hub as a versioned image, which allows us to pull it, run it and be done with it. Or not? Before modifying the Redis deployment, we have decided to run multiple tests to see whether the container overhead is low enough to be tolerable. And that is the point where things turn sour.

Our infrastructure runs on CentOS, but developers are free to use whatever OS they desire. As it happens, we had a laptop running Fedora (28) where we could compare the performance of containerized Redis to the production environment. The main test we’ll be looking into is redis-benchmark, and the results of that test surprised everyone.

Check it yourself (the benchmark is only set to get to keep the examples shorter, but this behavior is reproducible across all commands):

Fedora 28:

[root@localhost redis]# cat /etc/redhat-release
Fedora release 28 (Twenty Eight)
[root@localhost redis]# docker exec -it redis_redis_1 /usr/local/bin/redis-benchmark -t get -n 1000000

====== GET ======

1000000 requests completed in 13.01 seconds
50 parallel clients
3 bytes payload
keep alive: 1
99.10%     <=   1 milliseconds
99.96%   =  2 milliseconds
100.00%  =  3 milliseconds
100.00%  =  4 milliseconds
76869.86 requests per second

CentOS 7.5:

 
[root@localhost redis]# cat /etc/redhat-release 
CentOS Linux release 7.5.1804 (Core) 
[root@localhost redis]# docker exec -it redis_redis_1 /usr/local/bin/redis- benchmark -t get -n 1000000
 ====== GET ======
 1000000 requests completed in 29.80 seconds
 50 parallel clients
 3 bytes payload keep alive: 1 

93.82%   = 1 milliseconds 
99.74%   = 2 milliseconds 
99.98%   = 3 milliseconds
100.00%  = 4 milliseconds 
100.00%  = 4 milliseconds 
33559.30 requests per second 

Docker version is similar:

Fedora:

[root@localhost ~]# rpm -qa | grep docker
docker-ce-18.06.1.ce-3.fc28.x86_64

CentOS:

[root@localhost ~]# rpm -qa | grep docker
docker-ce-18.06.1.ce-3.el7.x86_64

Digging Deeper

Redis on CentOS could only deliver slightly over 1/3 of “gets per second” that we would see on Fedora. Also consider that Fedora was on an older laptop, whereas CentOS is running on a beefy machine within a datacenter. Let’s investigate what is going on within the system by using kernel’s perf framework. Whenever using perf, the default step should be looking at the top subcommand.

CentOS:


Samples: 81K of event 'cpu-clock', Event count (approx.): 14626844596

Overhead Shared Object           Symbol
24.87% [kernel]                 [k] sk_run_filter
6.79% [kernel]                  [k] _raw_spin_unlock_irqrestore
4.78% [kernel]                  [k] system_call_after_swapgs
4.50% [kernel]                  [k] ipt_do_table
3.49% [kernel]                  [k] __do_softirq

Fedora:

Samples: 26K of event 'cpu-clock', 4000 Hz, Event count (approx.): 6212691323

Overhead Shared Object           Symbol
9.48% [kernel]                   [k] ipt_do_table
6.38% [kernel]                   [k] _raw_spin_unlock_irqrestore
5.35% [kernel]                   [k] do_syscall_64
2.40% [kernel]                   [k] __softirqentry_text_start
1.86% libc-2.24.so               [.] epoll_ctl

By comparing the output of perf top on these machines, it seems that Redis on CentOS is spending most of the CPU cycles in sk_run_filter. What even is sk_run_filter? Let’s head into the kernel. Luckily (at least this time) sk_run_filter isn’t a secret per se, as according to the documentation it is one of the calls when running a process under seccomp (see more at lwn).

One possible solution at this point would be switching the Redis server to Fedora and we’re done, but that comes at two expenses:

  • no one wants to manage Fedora in production
  • we still don’t understand the cause of the problem

Resolution

Dismantling docker src RPMs on both systems doesn’t point out any difference in the used seccomp profile, which narrows our search down to the main culprit: kernel. As the changelog between 3.10 used in CentOS and 4.18 used in Fedora 28 is seemingly endless, we’ve decided against tracking the specific change that hinders the performance, and opted in for a simpler solution: disable seccomp for Redis containers.

Disabling seccomp is a matter of single step, in case of docker

 
docker run ... --security-opt=seccomp:unconfined ...

or in case of docker-compose:

 
services:
   redis:
   ...
     security_opt:
       - seccomp:unconfined
   ...

After disabling seccomp for Redis container, the results look like we initially expected:

Fedora:


[root@localhost redis]# cat /etc/redhat-release
Fedora release 28 (Twenty Eight)
[root@localhost redis]# docker exec -it redis_redis_1 /usr/local/bin/redis-benchmark -t get -n 1000000
====== GET ======
   1000000 requests completed in 15.52 seconds
   50 parallel clients
   3 bytes payload

keep alive: 1

97.33%   =  1 milliseconds
99.78%   =  2 milliseconds
99.97%   = 3 milliseconds
99.99%   =  4 milliseconds
100.00%  = 5 milliseconds
100.00%  =  6 milliseconds
100.00%  =  7 milliseconds
64437.14 requests per second

CentOS:


[root@localhost redis]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
[root@localhost redis]# docker exec -it redis_redis_1 /usr/local/bin/redis-benchmark -t get -n 1000000

====== GET ======

   1000000 requests completed in 15.65 seconds
   50 parallel clients
   3 bytes payload
   keep alive: 1

98.63%   =  1 milliseconds
99.91%   =  2 milliseconds
99.99%   =  3 milliseconds
99.99%   =  4 milliseconds
100.00%  =  5 milliseconds
100.00%  =  6 milliseconds
63918.18 requests per second

Would you like to work with us?

Would you be interested in solving multidisciplinary problems? We’re always on the lookout for problem solvers looking to get their hands dirty with challenges brought to us by software and modern infrastructure.

For available positions, see https://www.threatmark.com/career/or contact us directly at  if your position doesn’t exist (yet).

Martin Polednik