The case or sometimes called pair is the response variables paired with the predictor variables .
First fit a model of the form
where
library(boot) data(survival) # load survival dataset into memory # fit linear regression model to bootstrap sample (y*, x*) fit.boot <- function(data, i){ fit_b <- lm(log(surv) ~ dose, data[i,]) coef(fit_b) } # bootstrap fit.boot R times surv.boot <- boot(survival, fit.boot, R=1000, ncpus = 4, parallel="snow") surv.boot
ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = survival, statistic = fit.boot, R = 1000, parallel = "snow", ncpus = 4) Bootstrap Statistics : original bias std. error t1* 3.82364794 0.1074865998 0.73254328 t2* -0.00591454 -0.0002720479 0.00157819
# fit linear regression model to bootstrap sample (y*, x*) fit.boot <- function(res, i, x){ res_b <- res[i] # residual bootstrap fit_b <- lm((log(surv) + res_b) ~ dose, x) coef(fit_b) } # bootstrap fit.boot R times fit <- lm(log(surv) ~ dose, survival) surv.boot <- boot(fit$residuals, fit.boot, R=1000, ncpus = 4, parallel="snow", x=survival) surv.boot
ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = fit$residuals, statistic = fit.boot, R = 1000, x = survival, parallel = "snow", ncpus = 4) Bootstrap Statistics : original bias std. error t1* 3.82364794 -5.742551e-03 0.780623113 t2* -0.00591454 9.932477e-06 0.001006054
I didn’t set a seed so other runs might give different values due to random sampling.
t1* and t2* are the bootstrap estimates for the intercept and the dose term respectively.
These methods are good when there are few samples. The case bootstrap method is robust to outliers because some bootstrap samples will contain the outlier and will not which will capture the uncertainty due of the outliers.
]]>In R base graphics there are two ways to do this.
y <- rnorm(100) hist(y, prob=TRUE) x <- seq(min(y),max(y),by=0.01) lines(x, dnorm(x), col="purple")
There is a quicker way by using the curve
function. It requires that a variable x be in the function call and will evaluate the function along the same length as the original plot.
y <- rnorm(100) hist(y, prob=TRUE) curve(dnorm(x), col="purple", add=TRUE)
]]>
I see statistics defined as either the science of uncertainty or definition is extracting information from data. I found that if the person comes from a mathematical background they tend to gravitate towards the science of uncertainty. If they have a engineering background then extracting information from data comes up more.
The Webster online dictionary gives this definition.
a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data
a collection of quantitative data
“Statistics.” Merriam-Webster.com. Merriam-Webster, n.d. Web. 26 Jan. 2017.
I am not saying any of these are right or wrong, but I did find it interesting nonetheless. I personally think a rose by any other name would smell just as sweet
Cheers!
]]>Where t is the lag value applied to the time series. At the signals would be compared with no lag between them. Looking at the equation the function output is multiplied. If and both have a high magnitude in the same direction (positive, or negative) the output would be increased each value is them summed. The higher the output the more, “correlated the signals are”. The continuous version of the formula replaces the summation with an integral.
The ccf function in R performs a cross-correlation. The function takes a signal x, and y which translate to x is g and y is f in the formula above.
It’s been raining a lot lately and the lakes are getting full. I want to know the lag between when rain falls and when there is a change in the lake water levels.
Let’s be good data scientist and go get our data first.
Rain data for the us can be obtained from http://www.ncdc.noaa.gov/cdo-web/. Data is processed on-demand in orders from NOAA. It can take a few minutes to get your data set.
Here are the search parameters used to get the rainfall in 76126.This bring up a neat tool that shows the geographic location.
There is a options menu in the top left where you can select the data to download.
Once everything is set you can download add the data to your cart and check out. It asks for an e-mail, this e-mail can be used to download all your past orders. Each order is processed in a few minutes and a link is generated to download that dataset.
This one is a lot easier. Lake data can be downloaded by using the web services of http://waterdatafortexas.org/. To get the entire lake history the following using pattern will download a CSV file, “http://waterdatafortexas.org/reservoirs/individual/.csv”. If the lake name contains a space underscores must replace the spaces.
library(dplyr) # my go to for data transformations library(lubridate) # fancy date package # benbrook lake url found at http://www.waterdatafortexas.org/reservoirs/download benbrook_lake_url <- "http://waterdatafortexas.org/reservoirs/individual/benbrook.csv" lake_data <- read.csv(benbrook_lake_url, header = T, skip = 42, stringsAsFactors = F) lake_data$date <- ymd(lake_data$date) # convert date field to date datatype lake_data <- subset(lake_data, year(date) > 1989) # look at only dates in 1990 and on because there is no lake level data before 1990 # saved the downloaded NOAA dataset as rain_fall.csv rain_fall <- read.csv("rain_fall.csv", header = T, stringsAsFactors = F) rain_fall$DATE <- ymd(rain_fall$DATE) rain_fall <- rain_fall[, c("DATE", "PRCP")] names(rain_fall)[1] <- "date" names(rain_fall)[2] <- "prcp" # select only distinct dates rain_fall <- rain_fall %>% distinct(date) %>% arrange(date) rain_fall$date <- ymd(rain_fall$date) # create a consistent dataset as some dates are missing in each dataset dates_of_interest <- data.frame(date = seq(ymd('1990-01-01'),ymd('2016-05-20'), by = '1 day')) rain_fall <- merge(rain_fall, dates_of_interest, all = TRUE) lake_data <- merge(lake_data, dates_of_interest, all = TRUE)
R has a built in time series class. It makes doing time series analysis much easier to use this data type.
# convert to time series rain_fall_ts <- ts(rain_fall$prcp, start=c(1990,1,1), frequency=365.25) lake_data_ts <- ts(lake_data$water_level, start=c(1990,1,1), frequency=365.25) # poor man's imputation # imputes missing values with the median over window size localMedianImputation <- function(x, window = 10){ for(idx in which(is.na(x))){ x[idx] <- median(x[(idx - window/2):(idx + window/2)], na.rm = T) } x } # impute missing data for both datasets lake_data_ts <- localMedianImputation(lake_data_ts) rain_fall_ts <- localMedianImputation(rain_fall_ts)
At first one might try
ccf(rain_fall_ts, lake_data_ts), but this doesn’t really tell us much.
The dotted lines show the significance level. This doesn’t tell much because the cross-correlation is looking at the lake level and not the change in the lake level. It doesn’t really say anything of meaning
This time I am going to look at the changes in the lake level. This can be performed by taking the diff of the time series,
ccf(rain_fall_ts, diff(lake_data_ts)).
Now this, is cool. There is a significant correlation on the day-of and up to three days before with the highest correlation being the day before.
It appears at this location 76126 there is a positive correlation between precipitation and changes in the Benbrook lake levels.
Data Files
]]>Below is the code used to create Figure 1.
# used for handy data splitting library(caret) # grab a small dataset folds <- createFolds(mtcars$mpg, k = length(mtcars$mpg)) # create a stacked stripchart where each row is a fold pltLoc <- 1.4 for(fold in folds){ foldx <- rep(1, length(mtcars$mpg)) foldx[fold] = 0 if(pltLoc == 1.4){ stripchart(1:length(mtcars$mpg), main = "LOO Cross Validation", xlab = "index", ylab = "fold", col = foldx+2, pch = 22, bg = foldx+2, at = pltLoc) } else{ stripchart(1:length(mtcars$mpg), col = foldx+2, pch = 22, bg = foldx+2, add = TRUE, at = pltLoc) } pltLoc = pltLoc - 0.025 }
]]>
To begin I started a local node on my personnel computer with a max memory allocation of 24GB. H2O doesn’t allocate all the memory right away, but it does expect it to be there when it needs it.
# Load H2O library library(h2o) # create local node with max 24 Gigabytes of ram h2o <- h2o.init(max_mem_size = '24g') # Load training and test datasets train <- read.csv("train.csv") train$label <- as.factor(train$label) test <- read.csv("test.csv") # move h2o.train <- as.h2o(train, destination_frame = "training_data") h2o.test <- as.h2o(test, destination_frame = "testing_data") model.dl <- h2o.deeplearning(2:784, 1, training_frame = h2o.train, nfolds = 2, hidden = c(750,750)) # Plot Learning Rate plot(model.dl) # get prediction on test set from h2o deep learning model h2o.predictions <- h2o.predict(model.dl, h2o.test) # create prediction data frame predictions <- data.frame(ImageId = 1:nrow(h2o.predictions), Label = as.vector(h2o.predictions[,1])) write.csv(predictions, "h2odl_MINST_submission.csv", row.names = F)
All the functions in h2o typically start with the
h2o.*suffix. The h2o deep learning model is called by
h2o.deeplearning. As will all h2o models nfolds can be set to determine the number of k-fold cross-validations to perform for this exercise I used 2 folds, however the recommended number of folds is between 5 – 10 to remove bias. In lieu of k-fold cross-validation a validation frame can be set
validation_frame = "h2o validation data frame".
Using this script the submission score accuracy was 96% which is pretty good out of the box with default parameters. One feature of h2o which I hope to explore later is the grid search for optimization the hyper parameters of the model.
I am still impressed with how easy h2o is to use out of the box and look forward to learning how to leverage all h2o has to offer.
]]>I have always wanted to dig deeper into where the comes from.
The mathematical derivation is pretty straight forward. First, is to note that the mean is a sample mean of a population.
We know that the variance is equal to the expected value for the square difference from the mean.
Replacing with yields.
since
And there we have it.
Most commonly, standard error is a calculated from a sample for a sample mean .
For example the SE for a 95% confidence interval (alpha of 5%) from a normal distribution would be.
I found this a helpful exercise in gaining confidence if the formula’s I am using. The key was to understand that the SE is looking at the sample mean and not an individual sample value.
]]>
R H20 Tutorial
Installing h2o package.
# Install h2o package. Downloads latest version. # This might not be what wanted if making a cluster. # Client and server need same h2o version installed. # The download contains an R package in the R folder. install.packages("h2o")
Creating a simple gradient boost model.
library(h2o) h2o.init() # create h2o frame from R data frame iris.h2o <- as.h2o(iris) # Split data into training and test sets split <- h2o.splitFrame(iris.h2o, ratios = 0.75, destination_frames = c("iris.train","iris.test")) # create training and test set, workspace references iris.train <- split[[1]] iris.test <- split[[2]] # Train a multinomial model on the training data # x and y are vectors of column names x = names(iris.train)[1:4] # column 5 is Species iris.model.gbm <- h2o.gbm(x = x, y = "Species", training_frame = iris.train, model_id = "iris.model.gbm", distribution = "multinomial") # peak at model training results iris.model.gbm # make performance predictions on the test set perf <- h2o.performance(iris.model.gbm, iris.test) # Calculate the mean square error h2o.mse(perf) # shutdown the h2o node h2o.shutdown()
Node(s) store h2o data frames. The R workspace variables maintain references to h2o data frames. Removing an h2o object from the R workspace will not delete it from the cluster. In the example above, destination_frames, defines h2o data frame names on the cluster. Logging into a running cluster or node can be done, by ip address : port. In the above example localhost:54321 should work. With the web interface this tutorial could have been done with no R at all.
I look forward to exploring functionality within h2o especially deep learning models. I will post later how to deploy an h2o cluster in the near future.
]]>What is probability?
Outcomes of interest versus all possible outcomes. Mathematically this is represented by:
where P(A) is the probability. Numerical values for probability can range as a continuous variable from zero to one e.g(0.1, 0.99996, 0.23, 1).
For example a bag has five red marbles and six blue marbles. The probability of randomly picking a red marble, P(red marble), would be:
Five red marbles over all the possible marbles in the bag five red marbles and six blue marbles = eleven marbles.
And, thus starting from a full bag of eleven marbles. Probability of randomly picking a blue marble would be:
Probability can be of two types independent or dependent. Independent probability is not affected by other independent variables or probabilities. P(A) does not depend on the P(B), vice versa, where A and B are any independent probabilities.
Dependent probability commonly noted as P(A | B), probability of A given B. In other words, probability of A given B is true. Recycling the example above consider a bag has five red marbles and six blue marbles in addition two red marbles are magic and one blue marble is magic. Simplifying, A is picking a red marble and B picking a blue marble.
What is the probability of randomly picking a magic marble given that B is true?
One blue magic marble was in the bag. Given a blue marble was picked, remember only six blue marbles are in the bag. Probability of picking a magic marble given B is true, reduces to one magic marble out of six blue marbles.
I am reviewing probability as a primer for a deep dive into logistic regression. I wanted to mention one more fact that will help understand logistic regression, which is the probability of event A and B occurring, assuming A and B are independent events is define as:
.
With probability as a base, I will start a series on defining and implementing a logistic regression algorithm.
]]>install.packages("xlsx")
I produced the following error:
configure: error: Cannot compile a simple JNI program. See config.log for details. Make sure you have Java Development Kit installed and correctly registered in R. If in doubt, re-run "R CMD javareconf" as root. ERROR: configuration failed for package ‘rJava’ * removing ‘/home/hiro/R/x86_64-pc-linux-gnu-library/3.1/rJava’ Warning in install.packages : installation of package ‘rJava’ had non-zero exit status ERROR: dependency ‘rJava’ is not available for package ‘xlsxjars’ * removing ‘/home/hiro/R/x86_64-pc-linux-gnu-library/3.1/xlsxjars’ Warning in install.packages : installation of package ‘xlsxjars’ had non-zero exit status ERROR: dependencies ‘rJava’, ‘xlsxjars’ are not available for package ‘xlsx’ * removing ‘/home/hiro/R/x86_64-pc-linux-gnu-library/3.1/xlsx’ Warning in install.packages : installation of package ‘xlsx’ had non-zero exit status
sudo R CMD javareconf
Didn’t tell me much.
Then I tried to install the rJava package
install.packages("rJava")
I got two errors
/usr/bin/ld: cannot find -lpcre /usr/bin/ld: cannot find -lbz2
After searching I found the problem I was missing two libraries which can be installed with the following commands.
sudo apt-get install libpcre3-dev libz-dev
I had to close and open R Studio after all these steps.
Hope this helps someone else!
]]>