Cross-Correlation is when two vectors of data are correlated. It is a measure of how similar the two signals are. The discreet formula for cross correlation is:

Where t is the lag value applied to the time series. At the signals would be compared with no lag between them. Looking at the equation the function output is multiplied. If and both have a high magnitude in the same direction (positive, or negative) the output would be increased each value is them summed. The higher the output the more, “correlated the signals are”. The continuous version of the formula replaces the summation with an integral.

The ccf function in R performs a cross-correlation. The function takes a signal x, and y which translate to x is g and y is f in the formula above.

**Case Study**

It’s been raining a lot lately and the lakes are getting full. I want to know the lag between when rain falls and when there is a change in the lake water levels.

## Getting the Data

Let’s be good data scientist and go get our data first.

### Getting Precipitation (fancy word for rain) data

Rain data for the us can be obtained from http://www.ncdc.noaa.gov/cdo-web/. Data is processed on-demand in orders from NOAA. It can take a few minutes to get your data set.

Here are the search parameters used to get the rainfall in 76126.This bring up a neat tool that shows the geographic location.

There is a options menu in the top left where you can select the data to download.

Once everything is set you can download add the data to your cart and check out. It asks for an e-mail, this e-mail can be used to download all your past orders. Each order is processed in a few minutes and a link is generated to download that dataset.

### Getting Lake Level Data

This one is a lot easier. Lake data can be downloaded by using the web services of http://waterdatafortexas.org/. To get the entire lake history the following using pattern will download a CSV file, “http://waterdatafortexas.org/reservoirs/individual/.csv”. If the lake name contains a space underscores must replace the spaces.

## Load and Explore the Data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
library(dplyr) # my go to for data transformations library(lubridate) # fancy date package # benbrook lake url found at http://www.waterdatafortexas.org/reservoirs/download benbrook_lake_url <- "http://waterdatafortexas.org/reservoirs/individual/benbrook.csv" lake_data <- read.csv(benbrook_lake_url, header = T, skip = 42, stringsAsFactors = F) lake_data$date <- ymd(lake_data$date) # convert date field to date datatype lake_data <- subset(lake_data, year(date) > 1989) # look at only dates in 1990 and on because there is no lake level data before 1990 # saved the downloaded NOAA dataset as rain_fall.csv rain_fall <- read.csv("rain_fall.csv", header = T, stringsAsFactors = F) rain_fall$DATE <- ymd(rain_fall$DATE) rain_fall <- rain_fall[, c("DATE", "PRCP")] names(rain_fall)[1] <- "date" names(rain_fall)[2] <- "prcp" # select only distinct dates rain_fall <- rain_fall %>% distinct(date) %>% arrange(date) rain_fall$date <- ymd(rain_fall$date) # create a consistent dataset as some dates are missing in each dataset dates_of_interest <- data.frame(date = seq(ymd('1990-01-01'),ymd('2016-05-20'), by = '1 day')) rain_fall <- merge(rain_fall, dates_of_interest, all = TRUE) lake_data <- merge(lake_data, dates_of_interest, all = TRUE) |

R has a built in time series class. It makes doing time series analysis much easier to use this data type.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# convert to time series rain_fall_ts <- ts(rain_fall$prcp, start=c(1990,1,1), frequency=365.25) lake_data_ts <- ts(lake_data$water_level, start=c(1990,1,1), frequency=365.25) # poor man's imputation # imputes missing values with the median over window size localMedianImputation <- function(x, window = 10){ for(idx in which(is.na(x))){ x[idx] <- median(x[(idx - window/2):(idx + window/2)], na.rm = T) } x } # impute missing data for both datasets lake_data_ts <- localMedianImputation(lake_data_ts) rain_fall_ts <- localMedianImputation(rain_fall_ts) |

## Cross-Correlation

At first one might try ccf(rain_fall_ts, lake_data_ts) , but this doesn’t really tell us much.

The dotted lines show the significance level. This doesn’t tell much because the cross-correlation is looking at the lake level and not the change in the lake level. It doesn’t really say anything of meaning

This time I am going to look at the changes in the lake level. This can be performed by taking the diff of the time series,
ccf(rain_fall_ts, diff(lake_data_ts)) .

Now this, is cool. There is a significant correlation on the day-of and up to three days before with the highest correlation being the day before.

## Summary

It appears at this location 76126 there is a positive correlation between precipitation and changes in the Benbrook lake levels.

**Data Files **

You must log in to post a comment.