2  R Programming for Data Analysis

library(dplyr)

2.1 Data Cleaning

iris <- iris %>%
  mutate(Petal.Ratio = Petal.Length / Petal.Width)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
1          5.1         3.5          1.4         0.2  setosa        7.00
2          4.9         3.0          1.4         0.2  setosa        7.00
3          4.7         3.2          1.3         0.2  setosa        6.50
4          4.6         3.1          1.5         0.2  setosa        7.50
5          5.0         3.6          1.4         0.2  setosa        7.00
6          5.4         3.9          1.7         0.4  setosa        4.25
iris %>%
  group_by(Species) %>%
  summarise(mean_Petal_Ratio = mean(Petal.Ratio))
# A tibble: 3 × 2
  Species    mean_Petal_Ratio
  <fct>                 <dbl>
1 setosa                 6.91
2 versicolor             3.24
3 virginica              2.78

2.2 Split Dataset

There’s how you can get random sample.

iris %>% sample_n(10)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Petal.Ratio
1           5.7         2.9          4.2         1.3 versicolor    3.230769
2           5.7         2.8          4.1         1.3 versicolor    3.153846
3           5.0         3.3          1.4         0.2     setosa    7.000000
4           5.7         3.0          4.2         1.2 versicolor    3.500000
5           6.7         2.5          5.8         1.8  virginica    3.222222
6           6.7         3.3          5.7         2.5  virginica    2.280000
7           6.1         3.0          4.6         1.4 versicolor    3.285714
8           4.6         3.4          1.4         0.3     setosa    4.666667
9           5.1         3.5          1.4         0.3     setosa    4.666667
10          5.8         2.6          4.0         1.2 versicolor    3.333333
iris[sample(nrow(iris), 10), ]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Petal.Ratio
139          6.0         3.0          4.8         1.8  virginica    2.666667
118          7.7         3.8          6.7         2.2  virginica    3.045455
57           6.3         3.3          4.7         1.6 versicolor    2.937500
119          7.7         2.6          6.9         2.3  virginica    3.000000
70           5.6         2.5          3.9         1.1 versicolor    3.545455
33           5.2         4.1          1.5         0.1     setosa   15.000000
106          7.6         3.0          6.6         2.1  virginica    3.142857
145          6.7         3.3          5.7         2.5  virginica    2.280000
117          6.5         3.0          5.5         1.8  virginica    3.055556
36           5.0         3.2          1.2         0.2     setosa    6.000000

Please use y (target variable) as the variable into createDataPartition, rather than any other variables. p is the portition of training set.

library(caret)
training.samples = iris$Species %>%
  createDataPartition(p = 0.70, list = FALSE)

traindf = iris[training.samples,]
testdf = iris[-training.samples,]