library(dplyr)2 R Programming for Data Analysis
2.1 Data Cleaning
iris <- iris %>%
mutate(Petal.Ratio = Petal.Length / Petal.Width)
head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
1 5.1 3.5 1.4 0.2 setosa 7.00
2 4.9 3.0 1.4 0.2 setosa 7.00
3 4.7 3.2 1.3 0.2 setosa 6.50
4 4.6 3.1 1.5 0.2 setosa 7.50
5 5.0 3.6 1.4 0.2 setosa 7.00
6 5.4 3.9 1.7 0.4 setosa 4.25
iris %>%
group_by(Species) %>%
summarise(mean_Petal_Ratio = mean(Petal.Ratio))# A tibble: 3 × 2
Species mean_Petal_Ratio
<fct> <dbl>
1 setosa 6.91
2 versicolor 3.24
3 virginica 2.78
2.2 Split Dataset
There’s how you can get random sample.
iris %>% sample_n(10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
1 6.5 2.8 4.6 1.5 versicolor 3.066667
2 6.4 3.2 5.3 2.3 virginica 2.304348
3 5.0 3.3 1.4 0.2 setosa 7.000000
4 5.3 3.7 1.5 0.2 setosa 7.500000
5 6.7 2.5 5.8 1.8 virginica 3.222222
6 5.9 3.0 5.1 1.8 virginica 2.833333
7 5.5 2.6 4.4 1.2 versicolor 3.666667
8 4.9 2.4 3.3 1.0 versicolor 3.300000
9 5.8 2.7 4.1 1.0 versicolor 4.100000
10 6.3 2.9 5.6 1.8 virginica 3.111111
iris[sample(nrow(iris), 10), ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
93 5.8 2.6 4.0 1.2 versicolor 3.333333
148 6.5 3.0 5.2 2.0 virginica 2.600000
117 6.5 3.0 5.5 1.8 virginica 3.055556
104 6.3 2.9 5.6 1.8 virginica 3.111111
53 6.9 3.1 4.9 1.5 versicolor 3.266667
118 7.7 3.8 6.7 2.2 virginica 3.045455
141 6.7 3.1 5.6 2.4 virginica 2.333333
61 5.0 2.0 3.5 1.0 versicolor 3.500000
57 6.3 3.3 4.7 1.6 versicolor 2.937500
78 6.7 3.0 5.0 1.7 versicolor 2.941176
Please use y (target variable) as the variable into createDataPartition, rather than any other variables. p is the portition of training set.
library(caret)training.samples = iris$Species %>%
createDataPartition(p = 0.70, list = FALSE)
traindf = iris[training.samples,]
testdf = iris[-training.samples,]