Cross Correlation in R on Time Series Data

Cross-Correlation is when two vectors of data are correlated. It is a measure of how similar the two signals are. The discreet formula for cross correlation is:

(f *g)(t) = \sum_{x=-\infty}^{\infty} f(x)g(x - t)

Where t is the lag value applied to the time series. At t = 0 the signals would be compared with no lag between them. Looking at the equation the function output is multiplied. If  f  and g both have a high magnitude in the same direction (positive, or negative) the output would be increased each value is them summed. The higher the output the more, “correlated the signals are”.  The continuous version of the formula replaces the summation with an integral.

The ccf function in R performs a cross-correlation. The function takes a signal x, and y which translate to x is g and y is f in the formula above.

Case Study

It’s been raining a lot lately and the lakes are getting full. I want to know the lag between when rain falls and when there is a change in the lake water levels.

Getting the Data

Let’s be good data scientist and go get our data first.

Getting Precipitation (fancy word for rain) data

Rain data for the us can be obtained from http://www.ncdc.noaa.gov/cdo-web/. Data is processed on-demand in orders from NOAA. It can take a few minutes to get your data set.

Here are the search parameters used to get the rainfall in 76126.Screenshot from 2016-05-22 21:27:25This bring up a neat tool that shows the geographic location.

Screenshot from 2016-05-22 21:37:53

There is a options menu in the top left where you can select the data to download.

Screenshot from 2016-05-22 21:38:01

Once everything is set you can download add the data to your cart and check out. It asks for an e-mail, this e-mail can be used to download all your past orders. Each order is processed in a few minutes and a link is generated to download that dataset.

Screenshot from 2016-05-22 21:26:30

Getting Lake Level Data

This one is a lot easier. Lake data can be downloaded by using the web services of http://waterdatafortexas.org/. To get the entire lake history the following using pattern will download a CSV file, “http://waterdatafortexas.org/reservoirs/individual/.csv”. If the lake name contains a space underscores must replace the spaces.

Load and Explore the Data

R has a built in time series class. It makes doing time series analysis much easier to use this data type.

Cross-Correlation

At first one might try  ccf(rain_fall_ts, lake_data_ts) , but this doesn’t really tell us much.

cff_bad

The dotted lines show the significance level. This doesn’t tell  much because the cross-correlation is looking at the lake level and not the change in the lake level.  It doesn’t really say anything of meaning

This time I am going to look at the changes in the lake level. This can be performed by taking the diff of the time series,  ccf(rain_fall_ts, diff(lake_data_ts)) .
ccf_good

Now this, is cool. There is a significant correlation on the day-of and up to three days before with the highest correlation being the day before.

Summary

It appears at this location 76126 there is a positive correlation between precipitation and changes in the Benbrook lake levels.

Data Files 

rain_fall

benbrook