2  R Programming for Data Analysis

library(dplyr)

2.1 Data Cleaning

iris <- iris %>%
  mutate(Petal.Ratio = Petal.Length / Petal.Width)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
1          5.1         3.5          1.4         0.2  setosa        7.00
2          4.9         3.0          1.4         0.2  setosa        7.00
3          4.7         3.2          1.3         0.2  setosa        6.50
4          4.6         3.1          1.5         0.2  setosa        7.50
5          5.0         3.6          1.4         0.2  setosa        7.00
6          5.4         3.9          1.7         0.4  setosa        4.25
iris %>%
  group_by(Species) %>%
  summarise(mean_Petal_Ratio = mean(Petal.Ratio))
# A tibble: 3 × 2
  Species    mean_Petal_Ratio
  <fct>                 <dbl>
1 setosa                 6.91
2 versicolor             3.24
3 virginica              2.78

2.2 Split Dataset

There’s how you can get random sample.

iris %>% sample_n(10)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Petal.Ratio
1           6.5         2.8          4.6         1.5 versicolor    3.066667
2           6.4         3.2          5.3         2.3  virginica    2.304348
3           5.0         3.3          1.4         0.2     setosa    7.000000
4           5.3         3.7          1.5         0.2     setosa    7.500000
5           6.7         2.5          5.8         1.8  virginica    3.222222
6           5.9         3.0          5.1         1.8  virginica    2.833333
7           5.5         2.6          4.4         1.2 versicolor    3.666667
8           4.9         2.4          3.3         1.0 versicolor    3.300000
9           5.8         2.7          4.1         1.0 versicolor    4.100000
10          6.3         2.9          5.6         1.8  virginica    3.111111
iris[sample(nrow(iris), 10), ]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Petal.Ratio
93           5.8         2.6          4.0         1.2 versicolor    3.333333
148          6.5         3.0          5.2         2.0  virginica    2.600000
117          6.5         3.0          5.5         1.8  virginica    3.055556
104          6.3         2.9          5.6         1.8  virginica    3.111111
53           6.9         3.1          4.9         1.5 versicolor    3.266667
118          7.7         3.8          6.7         2.2  virginica    3.045455
141          6.7         3.1          5.6         2.4  virginica    2.333333
61           5.0         2.0          3.5         1.0 versicolor    3.500000
57           6.3         3.3          4.7         1.6 versicolor    2.937500
78           6.7         3.0          5.0         1.7 versicolor    2.941176

Please use y (target variable) as the variable into createDataPartition, rather than any other variables. p is the portition of training set.

library(caret)
training.samples = iris$Species %>%
  createDataPartition(p = 0.70, list = FALSE)

traindf = iris[training.samples,]
testdf = iris[-training.samples,]