3  Datasets to Practice

3.1 Environment Setup

We use the tidyverse suite for data manipulation and visualization.

library(tidyverse)
library(scales) # For professional axis labels

3.1.1 The Tidyverse Collection (Modern Data Science)

The tidyverse is a suite of packages designed for a seamless data workflow. Its datasets are primarily used to master Data Wrangling, Tidy Data principles, and Advanced Visualization.

Dataset Package Description Primary Analysis Use
mpg ggplot2 Fuel economy data for 38 car models (1999–2008). Exploratory Data Analysis (EDA): Practicing scatterplots, smoothing lines, and correlation between engine size and efficiency.
diamonds ggplot2 Prices and attributes of ~54,000 diamonds. Predictive Modeling: Ideal for training linear regression models or random forests due to its large sample size.
starwars dplyr Characteristics of characters (height, mass, species). Data Cleaning: Best for practicing how to handle NA (missing values) and complex list-columns.
table1-5 tidyr WHO Tuberculosis cases in different formats. Data Reshaping: Specifically used to learn pivot_longer() and pivot_wider() to reach a “tidy” state.
storms dplyr Wind and pressure data for tropical cyclones. Time Series & Mapping: Visualizing storm tracks over time and geographical coordinates.

3.1.2 ISwR (Introductory Statistics with R)

The ISwR package (by Peter Dalgaard) focuses on Biostatistics and Inference. These datasets are smaller but are “textbook cases” for specific mathematical tests.

Dataset Description Primary Analysis Use
thuesen Ventricular shortening velocity and blood glucose in diabetics. Simple Linear Regression: Testing the linear relationship between two continuous clinical variables.
energy Energy expenditure in lean vs. obese women. Hypothesis Testing: Perfect for practicing the Two-sample t-test and checking for equal variance.
juul IGF-I (Growth Hormone) levels across different ages. ANOVA & Polynomial Regression: Analyzing how a biological marker changes non-linearly across puberty stages.
melanom Survival data of patients with malignant melanoma. Survival Analysis: The gold standard for practicing Kaplan-Meier survival curves and Cox Proportional Hazards.
stroke Clinical outcomes of stroke patients in Estonia. Logistic Regression: Predicting binary outcomes (e.g., dead/alive) based on age, sex, and clinical history.

3.2 Loading the dataset

data("diamonds")