Website Time Series Anomaly Detection

Written by: Ed Orlando, Data Scientist

Project Description

Anomaly detection can be valuable in many ways. For example, anomalies can be used to detect fraud, better understand system health monitoring, or help data engineers identify spikes in website traffic. It can also be used to remove extreme outliers from datasets before modeling.

In this project, webpage traffic outliers were identified that utilized data from the calendar year 2016. The project focused on popular United States sports, entertainment, and political events and people.

This project applied the anomalize package to assist in identifying outliers.

Interactive Tableau Public Viz

A Tableau interactive viz was created that used the final output produced in this project. The interactive viz can be viewed by clicking the link below.

Website Anomaly Detection App.

Load Libraries

To get started, install/load the libraries listed below.

# 1.0 LIBRARIES ----
library(vroom)
library(tidyverse)
library(tidyquant)
library(lubridate)
library(rsample)
library(anomalize)
library(fuzzyjoin)
library(readxl)

Load Data

The original raw web data file included in the project can be downloaded here. The raw data was modified in a prior project I have offline since the original raw data file was very large (271 MB). I also created a couple lookup tables that create valuable features later in the pipeline.

To follow along, click here to download the sample files.

# 2.0 LOAD DATA ----
websites_sample_tbl <- vroom::vroom("Data_Sources/2020_06_15_Anomalize/websites_sample_tbl.csv", delim = ",")
page_summary_lkp_tbl <- read_xlsx("Data_Sources/2020_06_15_Anomalize/Topics_Dates_LKP.xlsx", sheet = "Sheet1")
topics_dates_lkp_tbl <- read_xlsx("Data_Sources/2020_06_15_Anomalize/Topics_Dates_LKP.xlsx", sheet = "Sheet2")

Subsets of the three (3) tibbles are listed below. The tibbles include information related to website traffic visits, the topic summary, details about the outliers, and the sources and links to the details.

head(websites_sample_tbl)
## # A tibble: 6 x 4
##    ...1 date       visits Page_Summary
##   <dbl> <date>      <dbl> <chr>       
## 1     1 2015-10-01    683 1999_       
## 2     2 2015-10-02    836 1999_       
## 3     3 2015-10-03    742 1999_       
## 4     4 2015-10-04    934 1999_       
## 5     5 2015-10-05    778 1999_       
## 6     6 2015-10-06    824 1999_

head(page_summary_lkp_tbl)
## # A tibble: 6 x 3
##   Category              Page_Summary      Page_Summary_Formatted
##   <chr>                 <chr>             <chr>                 
## 1 Music & Entertainment The_Force_Awakens The Force Awakens     
## 2 Music & Entertainment Leonardo_DiCaprio Leonardo DiCaprio     
## 3 Music & Entertainment Matt_Damon        Matt Damon            
## 4 Music & Entertainment Brie_Larson       Brie Larson           
## 5 Music & Entertainment Room_             Room                  
## 6 Music & Entertainment Joy_              Joy

head(topics_dates_lkp_tbl)
## # A tibble: 6 x 4
##   Page_Summary date                Description                Source            
##   <chr>        <dttm>              <chr>                      <chr>             
## 1 1999_        2016-04-21 00:00:00 On Apr 21, 2016, music le~ https://en.wikipe~
## 2 1999_        2016-04-22 00:00:00 On Apr 21, 2016, music le~ https://en.wikipe~
## 3 1999_        2016-04-23 00:00:00 On Apr 21, 2016, music le~ https://en.wikipe~
## 4 1999_        2016-04-24 00:00:00 On Apr 21, 2016, music le~ https://en.wikipe~
## 5 1999_        2016-04-25 00:00:00 On Apr 21, 2016, music le~ https://en.wikipe~
## 6 1999_        2016-04-26 00:00:00 On Apr 21, 2016, music le~ https://en.wikipe~

Finding Anomalies

A few handy functions in the anomalize package were included in the analysis. The official descriptions included in the site above are listed below.

time_decompose(): Separates the time series into seasonal, trend, and remainder components
anomalize(): Applies anomaly detection methods to the remainder component.
recomposed_l1 & recomposed_l2: added to calculate the limits that separate the “normal” data from the anomalies.

## ANOMALY DETECTION -----
websites_anomalies_tbl <- websites_sample_tbl %>%
    group_by(Page_Summary) %>%
    filter(is.na(date) == FALSE) %>%
    time_decompose(visits, method = "stl") %>%
    anomalize(remainder, method = "iqr", alpha = 0.014) %>%
    mutate(recomposed_l1 = season + trend + remainder_l1) %>%
    mutate(recomposed_l2 = season + trend + remainder_l2)

websites_anomalies_tbl %>% glimpse()
## Rows: 16,488
## Columns: 11
## Groups: Page_Summary [36]
## $ Page_Summary  <chr> "1999_", "1999_", "1999_", "1999_", "1999_", "1999_",...
## $ date          <date> 2015-10-01, 2015-10-02, 2015-10-03, 2015-10-04, 2015...
## $ observed      <dbl> 683, 836, 742, 934, 778, 824, 892, 803, 884, 901, 923...
## $ season        <dbl> -11.146825, 4.998995, 28.794452, 41.643661, -11.54451...
## $ trend         <dbl> 816.7495, 815.6226, 814.4956, 813.3686, 812.2417, 811...
## $ remainder     <dbl> -122.6027225, 15.3784293, -101.2900551, 78.9877085, -...
## $ remainder_l1  <dbl> -1940.961, -1940.961, -1940.961, -1940.961, -1940.961...
## $ remainder_l2  <dbl> 1979.766, 1979.766, 1979.766, 1979.766, 1979.766, 197...
## $ anomaly       <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",...
## $ recomposed_l1 <dbl> -1135.359, -1120.340, -1097.671, -1085.949, -1140.264...
## $ recomposed_l2 <dbl> 2785.368, 2800.387, 2823.056, 2834.778, 2780.463, 276...

Finally, an example of the anomalies viewed in R is listed below. The anomalize package includes a plot_anomalies() function that easily identifies the outliers.

In 2016, the Chicago Cubs were in the playoffs and eventually won the World Series. The popularity of that event is obviously apparent in the visual below.

websites_anomalies_tbl %>%
  filter(Page_Summary == "Chicago_Cubs") %>%
  filter(date >= as.Date("2016-01-01")) %>%
  plot_anomalies(ncol = 2, time_recomposed = TRUE)

To view the fully interactive Tableau Public viz with all 2016 popular topics listed, please click here.

Final Tidy Tibble With Proper Titles & Descriptions

Finally, we can add the web page Category, Proper Title, Description of Outlier Event, and the Source. This tibble is what is exported and loaded into the Tableau Public viz.

websites_tidy_tbl <- websites_anomalies_tbl %>%
    left_join(page_summary_lkp_tbl) %>%
    left_join(topics_dates_lkp_tbl)

websites_tidy_tbl %>% glimpse()
## Rows: 16,490
## Columns: 15
## Groups: Page_Summary [36]
## $ Page_Summary           <chr> "1999_", "1999_", "1999_", "1999_", "1999_",...
## $ date                   <dttm> 2015-10-01, 2015-10-02, 2015-10-03, 2015-10...
## $ observed               <dbl> 683, 836, 742, 934, 778, 824, 892, 803, 884,...
## $ season                 <dbl> -11.146825, 4.998995, 28.794452, 41.643661, ...
## $ trend                  <dbl> 816.7495, 815.6226, 814.4956, 813.3686, 812....
## $ remainder              <dbl> -122.6027225, 15.3784293, -101.2900551, 78.9...
## $ remainder_l1           <dbl> -1940.961, -1940.961, -1940.961, -1940.961, ...
## $ remainder_l2           <dbl> 1979.766, 1979.766, 1979.766, 1979.766, 1979...
## $ anomaly                <chr> "No", "No", "No", "No", "No", "No", "No", "N...
## $ recomposed_l1          <dbl> -1135.359, -1120.340, -1097.671, -1085.949, ...
## $ recomposed_l2          <dbl> 2785.368, 2800.387, 2823.056, 2834.778, 2780...
## $ Category               <chr> "Music & Entertainment", "Music & Entertainm...
## $ Page_Summary_Formatted <chr> "Prince", "Prince", "Prince", "Prince", "Pri...
## $ Description            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ Source                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

For questions related to this analysis, please message me on LinkedIn.

For access to more of my articles, please check out my blog.