Welcome to my Reddit uptime monitor at thenarwhalbaconsatmidnight.com!

This project is one part an adventure in learing new skills and one part chasing after a dream position at Reddit. The impedus for this idea came out of picturing myself in a position as Cloud Engineer for reddit.com - what technologies would I be involved in? What skills would I bring to the table, and how would I approach new and interesting challenges? As an avid user of Reddit, I'm well aware of the challenges that the site has faced in providing a consistent and quality experience while supporting a rapidly growing user base. I have faced similar challenges with my roles at Alarm.com, particularly when it comes to monitoring and addressing scalability of core services. One universal challenge our team faced was having good awareness on the health of our system. Oftentimes, knowing that one particular website or service is "up" doesn't tell the full story of how well a feature is being delivered to our user base. Additionally, having a good view of a system's baseline behavior is key to understanding its heatlth, and this baseline can only be seen through different scopes in time, not just from a single snapshot.


I thought that it'd be a neat project to apply my experience of defining health metrics to the Reddit website itself. Since this is a small endeavor - limited in resources and time - I figured that I would limit the scope of what I'm testing to a handful of website resources. I would track some basic statistics for each resource, then build a handful of charts to create a rough picture of the current state of Reddit. As a stretch goal, I would explore gleaning additional information about uptime patterns by correlating the data I have.

Basic Design

I began by scetching-out a rough design for the monitor: I would need some way to collect data on Reddit sites, some way to transform and chart the data, and some way to display it publicly.

Next, I made a rough draft of the details that would need to go into each piece of the design. Below are some raw notes:

  • Reddit resources to test
    • www.reddit.com
    • m.reddit.com
    • old.reddit.com
  • Metrics to collect
    • Timestamp of test
    • HTTP code returned
    • Content length (bytes)
    • Duration of request
  • Goal for uptime monitor: dashboard of informative charts
    • Sparkline chart showing all resources over X hours/days
    • Up/degraded/down indicator (one aggregate, one for each resource)
    • Live chart tracking request times (each resource)
    • Chart tracking content length (each resource)
    • Table showing HTTP codes returned in past period
  • Collecting data
    • Simple Python script on cron to collect data
    • How often to run?
      • More frequent = finer-tuned data
      • With greated frequency comes greater risk of being blocked
    • Geographical collection?
      • Outsource this, with Pingdom, perhaps?
    • Where to run?
      • Hostgater (this site)?
      • Locally? (Synology NAS)
  • Charting data
    • Timestamp of test
    • HTTP code returned
    • Content length (bytes)
    • Duration of request
  • Publishing work
    • Sparkline chart showing all resources over X hours/days
    • Up/degraded/down indicator (one aggregate, one for each resource)
    • Live chart tracking request times (each resource)
    • Chart tracking content length (each resource)
    • Table showing HTTP codes returned in past period
Step 1: Testing the websites

To be able to track the uptime of various Reddit sites, I would need a way to continuously poll the endpoints, then send the metrics to a data collector. Verifying the availability of each site could be done with a simple Python script. Below is the relevant code snippet:

        
            def poll(url):
                fullurl = "https://" + url
                r = requests.get(fullurl)
            
                ret = {}
                ret['url'] = url
                ret['status_code'] = r.status_code
                ret['content_length'] = len(r.content)
                ret['request_duration'] = r.elapsed.total_seconds()
                ret['event_time'] = datetime.datetime.today().isoformat()
            
                return ret
                
            urls = ["www.reddit.com", "m.reddit.com", "old.reddit.com"]
            polled_data = list(map(poll, urls))
        
        
Step 2: Collecting the data

Next, I wanted to push the site update metrics to a data aggregator that I could use to analyze and chart the information. The platform I've been most familiar with is Wavefront (now Aria Operations by VMWare). Wavefront provides an ingestion point for a variety of event data streams, as well as a customizable set of charts and dashboards which can be used to visualize the event data in real time.

I could leverage my familiarity with Wavefront to build a demonstration, but since this was an application geared towards Reddit, I was curious about what metric analytics platform that the Reddit infrastructure itself uses. I stumbled upon this blog post, https://www.redditinc.com/blog/scaling-reporting-at-reddit/, which led me to take a look at Imply Polaris. After reading through the post and Imply's documentation, I concluded that I could use Imply to collect and chart the data from my monitor script - it would be an interesting challenge to apply my previous experience with metric analysis in Wavefront to this new platform!

Imply has a 30-day period to trial their service. I signed up for demo, and began configuring my account to accept push events from my script

Step 3: Analyzing the data

Now that I had data from the site monitor flowing into an aggregator, I could create the charts I wanted to visualize the health of the Reddit websites I was checking. I created a dashboard to house the charts:

The top section displays the current status of all sites. Each subsequent section will contain more detailed information about each site being monitored.

Step 4: Future Work

This endeavor is truely a work in progress. I am currently spending time fleshing-out the remaining charts and cleaning-up this look to both the visualization and this site. Please check back often!

In addition to the above, there are several directions I would like to go in to make a more complete demonstration of site monitoring:

  • Deriving info
    • Chart to compare current metric performance against past performance
    • I.e., show current + last week + last 2 week, etc
  • Letting the data soak
    • Seeing what patterns develop over the course of a week
    • I.e., what are daily trends? Weekly trends?