library(dplyr)2 R Programming for Data Analysis
2.1 Data Cleaning
iris <- iris %>%
mutate(Petal.Ratio = Petal.Length / Petal.Width)
head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
1 5.1 3.5 1.4 0.2 setosa 7.00
2 4.9 3.0 1.4 0.2 setosa 7.00
3 4.7 3.2 1.3 0.2 setosa 6.50
4 4.6 3.1 1.5 0.2 setosa 7.50
5 5.0 3.6 1.4 0.2 setosa 7.00
6 5.4 3.9 1.7 0.4 setosa 4.25
iris %>%
group_by(Species) %>%
summarise(mean_Petal_Ratio = mean(Petal.Ratio))# A tibble: 3 × 2
Species mean_Petal_Ratio
<fct> <dbl>
1 setosa 6.91
2 versicolor 3.24
3 virginica 2.78
2.2 Split Dataset
There’s how you can get random sample.
iris %>% sample_n(10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
1 5.7 2.9 4.2 1.3 versicolor 3.230769
2 5.7 2.8 4.1 1.3 versicolor 3.153846
3 5.0 3.3 1.4 0.2 setosa 7.000000
4 5.7 3.0 4.2 1.2 versicolor 3.500000
5 6.7 2.5 5.8 1.8 virginica 3.222222
6 6.7 3.3 5.7 2.5 virginica 2.280000
7 6.1 3.0 4.6 1.4 versicolor 3.285714
8 4.6 3.4 1.4 0.3 setosa 4.666667
9 5.1 3.5 1.4 0.3 setosa 4.666667
10 5.8 2.6 4.0 1.2 versicolor 3.333333
iris[sample(nrow(iris), 10), ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
139 6.0 3.0 4.8 1.8 virginica 2.666667
118 7.7 3.8 6.7 2.2 virginica 3.045455
57 6.3 3.3 4.7 1.6 versicolor 2.937500
119 7.7 2.6 6.9 2.3 virginica 3.000000
70 5.6 2.5 3.9 1.1 versicolor 3.545455
33 5.2 4.1 1.5 0.1 setosa 15.000000
106 7.6 3.0 6.6 2.1 virginica 3.142857
145 6.7 3.3 5.7 2.5 virginica 2.280000
117 6.5 3.0 5.5 1.8 virginica 3.055556
36 5.0 3.2 1.2 0.2 setosa 6.000000
Please use y (target variable) as the variable into createDataPartition, rather than any other variables. p is the portition of training set.
library(caret)training.samples = iris$Species %>%
createDataPartition(p = 0.70, list = FALSE)
traindf = iris[training.samples,]
testdf = iris[-training.samples,]