Production Trenches: Pitfalls and Pratfalls

Bri Hatch Personal Work
Onsight, Inc
bri@ifokr.org
ExtraHop Networks
bri@extrahop.com

Copyright 2016, Bri Hatch, Creative Commons BY-NC-SA License

Welcome To

Audience

Who should be here?

A Fail Storm

"Apache was returning blank pages Sunday morning starting at 5:23 - what was wrong?"

A Fail Storm (cont)

Looking at the logs
06/Nov 05:23:59 "GET /index.html HTTP/1.1" 200 42331 ""
06/Nov 05:24:03 "GET /thing1/ HTTP/1.1" 200 76442 "http://example.com/index.html"
06/Nov 05:24:12 "GET /thing2/ HTTP/1.1" 200 65232 "http://example.com/thing1/"
....
But he was in Eastern, server is in Central!

A Fail Storm (cont)

Looking at the right logs
06/Nov 04:23:59 "GET /stuff/ HTTP/1.1" 200 61472 "http://example.com/"
06/Nov 04:24:03 "GET /thing0/ HTTP/1.1" 200 86442 "http://example.com/index.html"
06/Nov 04:24:12 "GET /about/index.html HTTP/1.1" 200 57774 ""
....
But the local clock is also wrong!

A Fail Storm (cont)

Looking at the right logs - really this time!
06/Nov 02:00:01 "GET /thing0/ HTTP/1.1" 200 55424 "http://www.google.com"
06/Nov 02:00:03 "GET /search/ HTTP/1.1" 200 92186 "http://example.com/about/index.html"
06/Nov 02:00:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/thing1/"
....
Still everything looks good - WTF?

A Fail Storm (cont)

06/Nov 02:00:01 "GET /thing0/ HTTP/1.1" 200 55424 "http://www.google.com"
Let's parse!

$  awk '{print $7, $2, $4}' access.log | sort -n
0 02:00:04 /thing3/ 0 02:00:09 /thing0/ 0 02:00:24 /checkout/ 0 02:00:32 /checkout/ ... 55424 02:00:01 /thing0/ ... 78505 02:00:42 /thing3/ 78505 02:59:42 /thing3/ ... 92186 02:00:03 /search/
But... But...

A Fail Storm (cont)

$ grep '06/Nov 02:.* /thing3/.* 0 ' access.log | wc -l
29

$ grep '06/Nov 02:.* /thing3/.* 0 ' access.log | head -1 06/Nov 02:00:04 "GET /thing3/ HTTP/1.1" 200 0 "http://example.com/thing1/"
$ grep '06/Nov 02:.* /thing3/.* ' access.log | head -1 06/Nov 02:00:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/thing1/"

A Fail Storm (cont)

So it has come to this...
$ less access.log ... 06/Nov 02:59:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/" 06/Nov 02:59:58 "GET /cart/ HTTP/1.1" 200 42331 "http://example.com/toolbox/"
06/Nov 02:00:04 "GET /thing3/ HTTP/1.1" 200 0 "http://example.com/thing1/" 06/Nov 02:00:09 "GET /thing0/ HTTP/1.1" 200 0 "http://example.com/" 06/Nov 02:00:24 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/cart/" 06/Nov 02:00:32 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/cart/" ...
DST change! You Bastard!

A Fail Storm (cont)

Why did our monitoring not catch this issue?

$ /usr/lib/nagios/plugins/check_http  -I example.com
HTTP OK: HTTP/1.1 200 OK - 0 bytes in 0.006 second response time

A Fail Storm (cont)

We happened to have dumps on disk - Wireshark Time!

$ ps -ef | grep tcpdump
tcpdump -n -s 9999 -w /bigdisk/dump.out -G 3600

Blame enough to go around.

Takeaways

Takeaways

Logs Lie

Logs Lie - WTF?

Examples:

Oncall

Typical Oncall Week

Alerting

Alerting

Pageworthy?

Pages must

Not pagewothy!

If you say any of these, it should not page.

Alert Criteria

Alert based on

Don't alert based on

SLAs

SLAs (cont)

Example SLO: Measure of uptime

SLAs (cont)

SLOs can be any sort of measure, such as

Waffly SLOs
  • Never down for more than 5 consecutive minutes in a one hour window
  • Less than 0.1% unscheduled downtime
  • Message guaranteed to be delivered no more than 72 hours after being sent

SLAs (cont)

What really is an SLA?

Measure of risk-taking potential!

When SLA violated...

There are many things you can do when SLA is not met

Oncall

Later Oncall Weeks

Alerting

Alerting

Oncall Tools

Oncall Tools

Monitoring Evolution

Monitoring is an ongoing process

Monitoring (cont)

Catch the unknown

Monitoring (cont)

Monitor the right things
define command {
    command_name http_port_grep_header  
    command_line /usr/lib/nagios/plugins/check_http -H '$HOSTADDRESS$'
                -I '$HOSTADDRESS$' -p '$ARG1$' -u '$ARG2$' -N
}
...
define service {
    check_command http_port_grep_header!3142!/acng-report.html
    hostgroup_name apt-cacher-ng
    service_description apt_cacher_ng
    use generic_service_template
}
...
define host {
    host_name vm-103
    address 192.168.253.288
    hostgroups apt-cacher-ng
    ...

Monitoring (cont)

Monitor the right things

Backups

Backups

Backups (cont)

From my pager conversation this am:

Backups (cont)

Pager conversation continued:

So... can you figure out what's wrong?

Testing

Testing

Limit yourself

Restrictions are freeing

Use the tools as designed

Use the tools as designed
git is in C so contributors have a high bar
git becoming svn w/ incrementing revision numbers, single commits only
team wrote a lot of work on new branch, could not push

Don't be clever

Don't be clever

Don't be clever (cont)

What does this do?

#!/bin/bash

set -e
set -u

cd $(dirname $0)
git pull --rebase
cd path/to/dnsstuff
export SSH_AUTH_SOCK=/path/to/ssh-agent

rsync data dns_server:/var/tinydns/data

And why is the new DNS server not getting updates?

Don't be clever (cont)

server$ rsync data new_dns_server:/var/tinydns/data
rsync: mkstemp "/var/tinydns/.tpmuc.lch0gC" failed: Permission denied (13)

new_dns_server$ ls -l /var/tinydns
ls: cannot access /var/tinydns: No such file or directory

Don't be clever (cont)

server$ rsync data new_dns_server:/var/tinydns/data
rsync: mkstemp "/var/tinydns/.tpmuc.lch0gC" failed: Permission denied (13)

new_dns_server$ ls -l /var/tinydns
ls: cannot access /var/tinydns: No such file or directory


server$ rsync data dns_server:/var/tinydns/data
New file uploaded
Merging local changes
Compiling
HAND!

dns_server$ ls -l /var/tinydns
ls: cannot access /var/tinydns: No such file or directory

Don't be clever (cont)

server$ rsync data dns_server:/var/tinydns/data
OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
...
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Offering RSA public key: dns_magic
debug1: Server accepts key: pkalg ssh-rsa blen 279
debug1: Entering interactive session.
debug1: Remote: Forced command.
debug1: Sending environment.
New file uploaded
Merging local changes
Compiling
HAND!
Connection to dns_server closed.
server$ 

Don't be clever (cont)

server$ rsync data dns_server:/var/tinydns/data
OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
...
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Offering RSA public key: dns_magic
debug1: Server accepts key: pkalg ssh-rsa blen 279
debug1: Entering interactive session.
debug1: Remote: Forced command.
debug1: Sending environment.
New file uploaded
Merging local changes
Compiling
HAND!
Connection to dns_server closed.
server$ 

Don't be clever (cont)

$ cat ~/.ssh/authorized_keys
command="/opt/bin/syncw" ssh-rsa AAAAB3NzaC1yc2EAAAAB....

Don't be clever (cont)

$ cat /opt/bin/syncw
#!/bin/sh
rsync --server . /var/dns/upstream_data
/home/bri/bin/makedns

Don't be clever (cont)

$ cat /home/bri/bin/makedns
#!/bin/sh
for dir in /var/dns/tinydns-[0-9]*/root
do
  cd $dir
  make
done

Don't be clever (cont)

$ cat /var/dns/tinydns-[0-9]*/root/Makefile
SRC_DATA=/var/dns/upstream_data
LOCAL_ZONES=data.local

data.cdb: $(LOCAL_ZONES) $(SRC_DATA)  $(DIRS)
    sort -u $(LOCAL_ZONES) $(SRC_DATA) >> data
    /usr/bin/tinydns-data

Don't be clever (cont)

This could have all been written explicitly
$ rsync data dns_server:/var/tinydns/data
$ ssh dns_server 'cd /var/tinydns && /usr/bin/tinydns-data'

Importance of Analogy

The Datacenter Upgrade

Find your own Fails!

Any questions?

Thanks!


LFNW Talk/Evaluation Page: http://lfnw.org/node/3898
Presentation Video on YouTube

PersonalWork

Bri Hatch
Onsight, Inc
bri@ifokr.org

Bri Hatch
ExtraHop Networks
bri@extrahop.com

Copyright 2016, Bri Hatch, Creative Commons BY-NC-SA License