library(tidyverse)
library(scales) # For professional axis labels3 Datasets to Practice
3.1 Environment Setup
We use the tidyverse suite for data manipulation and visualization.
3.1.1 The Tidyverse Collection (Modern Data Science)
The tidyverse is a suite of packages designed for a seamless data workflow. Its datasets are primarily used to master Data Wrangling, Tidy Data principles, and Advanced Visualization.
| Dataset | Package | Description | Primary Analysis Use |
|---|---|---|---|
mpg |
ggplot2 |
Fuel economy data for 38 car models (1999–2008). | Exploratory Data Analysis (EDA): Practicing scatterplots, smoothing lines, and correlation between engine size and efficiency. |
diamonds |
ggplot2 |
Prices and attributes of ~54,000 diamonds. | Predictive Modeling: Ideal for training linear regression models or random forests due to its large sample size. |
starwars |
dplyr |
Characteristics of characters (height, mass, species). | Data Cleaning: Best for practicing how to handle NA (missing values) and complex list-columns. |
table1-5 |
tidyr |
WHO Tuberculosis cases in different formats. | Data Reshaping: Specifically used to learn pivot_longer() and pivot_wider() to reach a “tidy” state. |
storms |
dplyr |
Wind and pressure data for tropical cyclones. | Time Series & Mapping: Visualizing storm tracks over time and geographical coordinates. |
3.1.2 ISwR (Introductory Statistics with R)
The ISwR package (by Peter Dalgaard) focuses on Biostatistics and Inference. These datasets are smaller but are “textbook cases” for specific mathematical tests.
| Dataset | Description | Primary Analysis Use |
|---|---|---|
thuesen |
Ventricular shortening velocity and blood glucose in diabetics. | Simple Linear Regression: Testing the linear relationship between two continuous clinical variables. |
energy |
Energy expenditure in lean vs. obese women. | Hypothesis Testing: Perfect for practicing the Two-sample t-test and checking for equal variance. |
juul |
IGF-I (Growth Hormone) levels across different ages. | ANOVA & Polynomial Regression: Analyzing how a biological marker changes non-linearly across puberty stages. |
melanom |
Survival data of patients with malignant melanoma. | Survival Analysis: The gold standard for practicing Kaplan-Meier survival curves and Cox Proportional Hazards. |
stroke |
Clinical outcomes of stroke patients in Estonia. | Logistic Regression: Predicting binary outcomes (e.g., dead/alive) based on age, sex, and clinical history. |
3.2 Loading the dataset
data("diamonds")