Varnish – Nothing but 503s

I use Varnish on my production server without any issues – it works quite well, and I have come to consider it an essential component in my server stack. I have recently been having a bit of trouble with a new project of mine that I currently believe a misconfigured Varnish instance to be responsible for. As such, I wanted to reproduce the problem in my test environment, but ran into a completely different problem attempting to do so.

In its most basic form, I have Nginx listening on port 81 (all interfaces), and varnish listening on port 80 (all interfaces). Essentially, Varnish receives a request – checks its cache, and either fetches the request from cache if it is a hit, or from the backend if it is a miss.

The problem was that Varnish simply did not acknowledge the existence of the backend, continually returning ‘503 Service Unavailable’. The following is an account of my diagnostics and resolution to this problem. (Please note: this is a different problem than an intermittent 503 error – I was never once able to successfully load a page through Varnish before I fixed this problem).

Environment:

  • Virtualbox 4.1.8r75647
  • Host operating system: Windows 7 SP1 (64-bit)
  • Guest operating system: CentOS 6.2 (32-bit) (minimal install)
  • Networking: bridged adapter

As a basic setup, I used CentOS 6.2 – since this most closely resembles the Amazon Linux that I use on my production box. The official Nginx and Varnish repositories were added and nginx and varnish were installed:

yum localinstall http://nginx.org/packages/centos/6/noarch/RPMS/nginx-release-centos-6-0.el6.ngx.noarch.rpm
yum localinstall --nogpgcheck http://repo.varnish-cache.org/redhat/varnish-3.0/el5/noarch/varnish-release-3.0-1.noarch.rpm
yum install nginx varnish

Configure Nginx

I created a basic test site – with a single static file, and overrode the default Nginx config:

mkdir -p /var/www/example.com/web
echo "Testing 1, 2, 3..." > /var/www/example.com/web/test.txt
> /etc/nginx/conf.d/default.conf

/etc/nginx/conf.d/default.conf:

server {
    server_name example.com www.example.com;
    listen 81 default;
    root /var/www/example.com/web;
}

You’ll note here that I am listening on port 81 instead of port 80 – this is because I intend for Varnish to be the front end server. After these changes, I restarted Nginx:

service nginx start

Verify that Nginx is running and listening on the correct port:

ps -ef | grep nginx
root      1502     1  0 07:09 ?        00:00:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx     1504  1502  0 07:09 ?        00:00:00 nginx: worker process
root      1670  1203  0 07:15 pts/0    00:00:00 grep nginx
netstat -anp | grep nginx
tcp        0      0 0.0.0.0:81                  0.0.0.0:*                   LISTEN      1502/nginx
unix  3      [ ]         STREAM     CONNECTED     15289  1502/nginx
unix  3      [ ]         STREAM     CONNECTED     15288  1502/nginx

Finally see if I can retrive the page:

curl --header "Host: example.com" 127.0.0.1:81/test.txt
Testing 1, 2, 3...

So far, everything looks good – Nginx is working without issue, as expected.

Configure Varnish

I am going for a really minimal setup here – simply trying to illustrate the problem at hand – no optimizations at all in place. Essentially, I just change the Varnish port to 80 and use malloc instead of file backed storage:

Edit /etc/sysconfig/varnish, and change the following:

VARNISH_LISTEN_PORT=80
VARNISH_STORAGE_SIZE=50M
VARNISH_STORAGE="malloc,${VARNISH_STORAGE_SIZE}"

Now, we need to point Varnish to the backend that is running on port 81:

Edit /etc/varnish/default.vcl:

backend default {
    .host = "127.0.0.1";
    .port = "81";
}

Finally, we start Varnish:

service varnish start

Confirm that it is in fact running:

ps -ef | grep varnish
root      1571     1  0 07:11 ?        00:00:00 /usr/sbin/varnishd -P /var/run/varnish.pid -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 120 -w 1,1000,120 -u varnish -g varnish -S /etc/varnish/secret -s malloc,50M
varnish   1572  1571  0 07:11 ?        00:00:00 /usr/sbin/varnishd -P /var/run/varnish.pid -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 120 -w 1,1000,120 -u varnish -g varnish -S /etc/varnish/secret -s malloc,50M
root      1668  1203  0 07:15 pts/0    00:00:00 grep varnish

Verify that it is listening on the correct port:

netstat -anp | grep varnish
tcp        0      0 0.0.0.0:80                  0.0.0.0:*                   LISTEN      1572/varnishd
tcp        0      0 127.0.0.1:6082              0.0.0.0:*                   LISTEN      1571/varnishd
tcp        0      0 :::80                       :::*                        LISTEN      1572/varnishd
unix  2      [ ]         DGRAM                    15619  1571/varnishd

Trying to retrieve the same file from port 80 (Varnish) results in the following:

curl --header "Host: example.com" 127.0.0.1:80/test.txt

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>503 Service Unavailable</title>
  </head>
  <body>
    <h1>Error 503 Service Unavailable</h1>
    <p>Service Unavailable</p>
    <h3>Guru Meditation:</h3>
    <p>XID: 1041692010</p>
    <hr>
    <p>Varnish cache server</p>
  </body>
</html>

At the same time, looking at the output of varnishlog displays:

   11 SessionOpen  c 127.0.0.1 41777 :80
   11 ReqStart     c 127.0.0.1 41777 1041692011
   11 RxRequest    c GET
   11 RxURL        c /test.txt
   11 RxProtocol   c HTTP/1.1
   11 RxHeader     c User-Agent: curl/7.19.7 (i686-pc-linux-gnu) libcurl/7.19.7 NSS/3.12.7.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
   11 RxHeader     c Accept: */*
   11 RxHeader     c Host: example.com
   11 VCL_call     c recv lookup
   11 VCL_call     c hash
   11 Hash         c /test.txt
   11 Hash         c example.com
   11 VCL_return   c hash
   11 VCL_call     c miss fetch
   11 FetchError   c no backend connection
   11 VCL_call     c error deliver
   11 VCL_call     c deliver deliver
   11 TxProtocol   c HTTP/1.1
   11 TxStatus     c 503
   11 TxResponse   c Service Unavailable
   11 TxHeader     c Server: Varnish
   11 TxHeader     c Content-Type: text/html; charset=utf-8
   11 TxHeader     c Retry-After: 5
   11 TxHeader     c Content-Length: 419
   11 TxHeader     c Accept-Ranges: bytes
   11 TxHeader     c Date: Sun, 11 Mar 2012 11:16:54 GMT
   11 TxHeader     c X-Varnish: 1041692011
   11 TxHeader     c Age: 0
   11 TxHeader     c Via: 1.1 varnish
   11 TxHeader     c Connection: close
   11 Length       c 419
   11 ReqEnd       c 1041692011 1331464614.671160936 1331464614.671626806 0.000131845 0.000425816 0.000040054
   11 SessionClose c error
   11 StatSess     c 127.0.0.1 41777 0 1 1 0 0 0 257 419

Note the line:

11 FetchError c no backend connection

Obviously though, I have a valid backend – I can reach it, and it returns a valid page. As part of my diagostics, I tried the following:

  • Disabling iptables (service iptables stop)
  • Passing the backend directly to Varnish using the -b parameter (and removing the config file).
  • Trying a different backend
  • Restarting the VM and all servers more times than I can count
  • Rebuilding the entire VM from scratch
  • Using the external IP address for the backend instead of the loopback

None of these had any effect or revealed anything useful – and Varnish didn’t log any errors either.

SELinux

For some odd reason though, I thought to try setting SELinux to permissive – and that was a success.

If you are running auditd, the actions SELinux denies will be logged to /var/log/audit/audit.log

Running grep varnish /var/log/audit/audit.log, shows the blocks that occured:

type=AVC msg=audit(1331500393.450:25): avc: denied { name_connect } for pid=1276 comm="varnishd" dest=81 scontext=unconfined_u:system_r:varnishd_t:s0 tcontext=system_u:object_r:reserved_port_t:s0 tclass=tcp_socket

The current state of SELinux enforcement can be determined with:

cat /selinux/enforce
1

Setting SELinux to temporarily permissive and retesting shows that it was, indeed, the source of the problem:

echo 0 >/selinux/enforce
curl --header "Host: example.com" 127.0.0.1:80/test.txt
Testing 1, 2, 3...

Of course, this is only a temporary fix – as soon as the server is restarted, SELinux will go back to enforcing mode.

You can setup SELinux as permanently permissive, or generate a policy from the logged denial using audit2allow.

Install audit2allow:

yum install policycoreutils-python

Running audit2allow -a -w gives the following:

type=AVC msg=audit(1331500393.450:25): avc:  denied  { name_connect } for  pid=1276 comm="varnishd" dest=81 scontext=unconfined_u:system_r:varnishd_t:s0 tcontext=system_u:object_r:reserved_port_t:s0 tclass=tcp_socket
        Was caused by:
        One of the following booleans was set incorrectly.
        Description:
        Allow varnishd to connect to all ports, not just HTTP.

        Allow access by executing:
        # setsebool -P varnishd_connect_any 1
		Description:
        Allow system to run with NIS

        Allow access by executing:
        # setsebool -P allow_ypbind 1

Running setsebool -P varnishd_connect_any 1, and trying to fetch the page through Varnish, shows that the problem has indeed been fixed, although SELinux is still in enforcing mode (I didn’t need to run the second command). The change persists through a server restart.

By cyberx86

Just a random guy who dabbles with assorted technologies yet works in a completely unrelated field.

14 comments

  1. Hi there,

    thanks for the hint. I am experiencing the same odd thing. At some time varnish just rejects to open backend connections even if they’re alive. Nothing obvious in the logs.
    Any other idea?

    greetz
    Robert

    1. The thing with the SELinux cause is that it is constant – the response is always 503. Unfortunately, I don’t have a lot to go by, so a few guesses – if it happens during times of load, check and increase your timeouts (.connect_timeout; .first_byte_timeout; .between_bytes_timeout); Give saint mode a try if you find that the timeouts improve things. One other possibility that comes to mind is that a related set of requests all result in 50x errors on the backend, causing varnish to mark the backend as invalid for a short time.

  2. Hi, I had got this the same error and I found your site.
    So, I don’t read all contents. But my site had been fixed by check error code 500 from backend web-service nginx/apache. You can try it.

    1. The problem you describe is a different one than is outlined here. Thanks for commenting, good to know you got your site working.

  3. I’m just stopping by to say thank you for this post. I’ve experienced exactly the same problem on a server, I’ve gone through all the phases of despair you went through (iptables, another ip, pulling my hair 🙂 – and it turned out to be selinux. Thank you, it’s still relevant as of 2013-04 despite the warning on the top.

    1. You’re most welcome – glad you found it useful. I automatically add the notice at the top of posts on tech subjects, since I can’t always revisit them and update their content.

Leave a comment

Your email address will not be published. Required fields are marked *