Production Trenches: Pitfalls and Pratfalls

Bri Hatch Personal Work
Onsight, Inc
bri@ifokr.org
ExtraHop Networks
bri@extrahop.com

Copyright 2015, Bri Hatch, Creative Commons BY-NC-SA License

Audience

Who should be here?

Background

Who's this Bri guy?

Importance of Analogy

The Datacenter Upgrade

A Fail Storm

"Apache was returning blank pages Sunday morning starting at 5:23 - what was wrong?"

A Fail Storm (cont)

Looking at the logs
06/Nov 05:23:59 "GET /index.html HTTP/1.1" 200 42331 ""
06/Nov 05:24:03 "GET /thing1/ HTTP/1.1" 200 76442 "http://example.com/index.html"
06/Nov 05:24:12 "GET /thing2/ HTTP/1.1" 200 65232 "http://example.com/thing1/"
....
Looking at logs
Realized - guy is in Eastern, we're in Central

A Fail Storm (cont)

Wait - these look the same!
Looking at the right logs
06/Nov 04:23:59 "GET /stuff/ HTTP/1.1" 200 61472 "http://example.com/"
06/Nov 04:24:03 "GET /thing0/ HTTP/1.1" 200 86442 "http://example.com/index.html"
06/Nov 04:24:12 "GET /about/index.html HTTP/1.1" 200 57774 ""
....
Realize the local clock is also wrong!

A Fail Storm (cont)

Looking at the right logs - really this time!
Still everything looks good... :-(
06/Nov 02:00:01 "GET /thing0/ HTTP/1.1" 200 55424 "http://www.google.com"
06/Nov 02:00:03 "GET /search/ HTTP/1.1" 200 92186 "http://example.com/about/index.html"
06/Nov 02:00:42 "GET /thing3/ HTTP/1.1" 200 78505 "http://example.com/thing1/"
....

A Fail Storm (cont)

Looking at the right logs - fourth time's the charm!
Whoops - this was DST change
06/Nov 02:59:58 "GET /cart/ HTTP/1.1" 200 42331 "http://example.com/toolbox/"
06/Nov 02:00:03 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/cart/"
06/Nov 02:00:03 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/checkout/"
06/Nov 02:00:05 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/checkout/"
06/Nov 02:00:22 "GET /checkout/ HTTP/1.1" 200 0 "http://example.com/checkout/"
....

A Fail Storm (cont)

Why did our monitoring not catch this issue?
$ /usr/lib/nagios/plugins/check_http  -I example.com
HTTP OK: HTTP/1.1 200 OK - 0 bytes in 0.006 second response time
Serving errors is really fast!

A Fail Storm (cont)

Happened to have dumps on disk
$ ps -ef | grep tcpdump tcpdump -n -s 9999 -w /bigdisk/dump.out -G 3600
Wireshark time!

Takeaways

Takeaways

Logs Lie

Logs Lie - WTF?
Handoff to kernel
Negotiated and remote dropped

Monitoring

Monitoring
made very smart checks in WWW::Mechanize
Every push required new logic
Slowed down bip checks

Monitoring (cont)

 check_http -H  | -I  [-u ] [-p ]
       [-J ] [-K ]
       [-w ] [-c ] [-t ] [-L] [-E] [-a auth]
       [-b proxy_auth] [-f ]
       [-e ] [-d string] [-s string] [-l] [-r  | -R ]
       [-P string] [-m :] [-4|-6] [-N] [-M ]
       [-A string] [-k string] [-S ] [--sni] [-C [,]]
       [-T ] [-j method]

Monitoring (cont)

Catch the known

Monitoring (cont)

Catch the unknown

Alerting

Alerting Alert based on

SLAs

SLA: Measure of uptime
9 8s

SLAs (cont)

What really is an SLA?

Limit yourself

Restrictions are freeing
esix, hyperv, kvm, virtualbox, aws, azure, openstack
gerrit, stash, gitolite

Use the tools as designed

Use the tools as designed
git is in C so contributors have a high bar
git becoming svn w/ incrementing revision numbers, single commits only
team wrote a lot of work on new branch, could not push

Don't be clever

Don't be clever

Don't be clever (cont)

What does this do?

rsync data dns_server:/var/tinydns/data

Don't be clever (cont)

$ cat ~/.ssh/authorized_keys
command="/opt/bin/syncw" ssh-rsa AAAAB3NzaC1yc2EAAAAB....

Don't be clever (cont)

$ cat /opt/bin/syncw
#!/bin/sh
rsync --server . /var/dns/upstream_data
/home/bri/bin/makedns

Don't be clever (cont)

$ cat /home/bri/bin/makedns
#!/bin/sh
for dir in /var/dns/tinydns-[0-9]*/root
do
  cd $dir
  make
done

Don't be clever (cont)

$ cat /var/dns/tinydns-[0-9]*/root/Makefile
SRC_DATA=/var/dns/upstream_data
LOCAL_ZONES=data.local

data.cdb: $(LOCAL_ZONES) $(SRC_DATA)  $(DIRS)
    sort -u $(LOCAL_ZONES) $(SRC_DATA) >> data
    /usr/bin/tinydns-data

Don't be clever (cont)

This could have all been written as:
rsync data dns_server:/var/tinydns/data
ssh dns_server 'cd /var/tinydns/data && /usr/bin/tinydns-data'

Find your own Fails!

Any questions?

Thanks!

Presentation: http://www.ifokr.org/bri/presentations/seagl-2015-production-trenches/

PersonalWork

Bri Hatch
Onsight, Inc
bri@ifokr.org

Bri Hatch
ExtraHop Networks
bri@extrahop.com

Copyright 2015, Bri Hatch, Creative Commons BY-NC-SA License