Preface

Welcome to the second edition of “Introduction to Modern Statistics”!

We hope readers will take away three ideas from this book in addition to forming a foundation of statistical thinking and methods.

Statistics is an applied field with a wide range of practical applications.
You don’t have to be a math guru to learn from interesting, real data.
Data are messy, and statistical tools are imperfect. However, when you understand the strengths and weaknesses of these tools, you can use them to learn interesting things about the world.

Textbook overview

Part 1: Introduction to data. Data structures, variables, summaries, graphics, and basic data collection and study design techniques.
Part 2: Exploratory data analysis. Data visualization and summarization, with particular emphasis on multivariable relationships.
Part 3: Regression modeling. Modeling numerical and categorical outcomes with linear and logistic regression and using model results to describe relationships and make predictions.
Part 4: Foundations for inference. Case studies are used to introduce the ideas of statistical inference with randomization tests, bootstrap intervals, and mathematical models.
Part 5: Statistical inference. Further details of statistical inference using randomization tests, bootstrap intervals, and mathematical models for numerical and categorical data.
Part 6: Inferential modeling. Extending inference techniques presented thus-far to linear and logistic regression settings and evaluating model performance.

Each part contains multiple chapters and ends with a case study. Building on the content covered in the part, the case study presents a high-level overview using the tools and techniques from the part.

In the chapters that cover statistical inference, we have presented a parallel structure that walks the student through both computational and mathematical approaches to every inferential topic. Trying to cover every approach for every topic is likely too much material for a one semester class. We suggest that you make deliberate choices for navigating the book with your students. A few potential paths through the book (with chapter numbers in parentheses) are given as follows:

Focus on parallel structure of computational and mathematical methods: Introduction to data (1, 2), Exploratory data analysis (4, 5), Regression (7), Foundations (11, 12, 13, 14), Inference (a subset of: 16, 17, 18, 19, 20, 21, 22; potentially: 16, 17, 19, 20).
Focus on computational methods: Introduction to data (1, 2), Exploratory data analysis (4, 5), Regression (7), Foundations (11, 12, 14), Inference (computational methods only for some subset of: 16, 17, 18, 19, 20, 21, 22).
Focus on mathematical methods: Introduction to data (1, 2), Exploratory data analysis (4, 5), Regression (7), Foundations (11, 12, 13, 14), Inference (mathematical methods only for some subset of: 16, 17, 18, 19, 20, 21, 22).
Focus on modeling: Introduction to data (1, 2), Exploratory data analysis (4, 5), Regression (7, 8, 9), Foundations (11, 12, 13, 14), Inference (19), Inferential modeling (24, 25, 26).

We expect that most courses following a classical syllabus will not have time to cover the chapters in the last part, Inferential modeling (24, 25, 26).

Each chapter ends with a review which contains a chapter summary as well as a list of key terms introduced in the chapter. If you’re not sure what some of these terms mean, we recommend you go back in the text and review their definitions. We purposefully present them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate. However, you should be able to easily spot them as bolded text.

Changes for the second edition

While the second edition does not represent a major change from the first edition, we have worked hard to improve content, to add exercises, and to update text and code to reflect changes in best practices (e.g., the book is now written in Quarto).

A brief summary of the biggest changes follows:

Twenty-five completely new exercises were added. Most of the new exercises are concatenated onto existing exercises so as to retain similar numbering across editions. However, a few exercises have been moved in order to produce both odd exercises (with solutions) and even exercises (without solutions) on the same topic.
Multiple datasets were added or updated. For example, the pm25_2022_durham data on air quality in Durham, NC in 2022 can be found in the openintro R package.
3 Applications: Data was re-written with an updated context and data example. Additionally, in 3 Applications: Data, we explore Simpson’s Paradox.
Throughout the text and the exercises, “statistically significant” has been changed to “statistically discernible” so as to distance ourselves from the more colloquial use of the word “significant.”

Examples and exercises

Examples are provided to establish an understanding of how to apply methods.

This is an example. When a question is asked here, where can the answer be found?

The answer can be found here, in the solution section of the example!

When we think the reader is ready to try determining a solution on their own, we frame it as Guided Practice.

The reader may check or learn the answer to any Guided Practice problem by reviewing the full solution in a footnote.¹

Exercises are also provided at the end of each chapter. Solutions are given for odd-numbered exercises in Appendix Appendix A — Exercise solutions.

Datasets and their sources

A large majority of the datasets used in the book can be found in various R packages. Each time a new dataset is introduced in the narrative, a reference to the package like the one below is provided. Many of these datasets are in the openintro R package that contains datasets used in OpenIntro’s open-source textbooks.²

The textbooks data can be found in the openintro R package.

The datasets used throughout the book come from real sources like opinion polls and scientific articles, except for a handful of cases where we use toy data to highlight a particular feature or explain a particular concept. References for the sources of the real data are provided at the end of the book.

Computing with R

The narrative and the exercises in the book are computing language agnostic, however while it’s possible to learn about modern statistics without computing, it’s not possible to apply it. Therefore, we invite you to navigate the concepts you have learned in each part using the interactive R tutorials and the R labs that are included at the end of each part.

Interactive R tutorials

The self-paced and interactive R tutorials were developed using the learnr R package, and only an internet browser is needed to complete them.

Each part comes with a tutorial comprised of 4-10 lessons and listed like this.

Each of these lessons is listed like this.

You can access the full list of tutorials supporting this book here.

R labs

Once you feel comfortable with the material in the tutorials, we also encourage you to apply what you’ve learned via the computational labs that are also linked at the end of each part. The labs consist of data analysis case studies, and they require access to R and RStudio. The first lab includes installation instructions. If you’d rather not install the software locally, you can also try Posit Cloud for free.

Labs for each part are listed like this.

You can access the full list of labs supporting this book here.

OpenIntro, online resources, and getting involved

OpenIntro is an organization focused on developing free and affordable education materials. We encourage anyone learning or teaching statistics to visit openintro.org and to get involved.

All OpenIntro resources are free and anyone is welcomed to use these online tools and resources with or without this textbook as a companion.

We value your feedback. If there is a part of the project you especially like or think needs improvement, we want to hear from you. For feedback on this specific book, you can open an issue on the GitHub repository of the book at github.com/openintrostat/ims. You can also provide feedback on this book or any other OpenIntro resource via our contact form at openintro.org.

Acknowledgements

The OpenIntro project would not have been possible without the dedication and volunteer hours of all those involved, and we hope you will join us in extending a huge thank you to all those who volunteer with OpenIntro.

The authors would like to thank the following individuals:

David Diez and Christopher Barr for their work on the 1st Edition of this book,
Ben Baumer and Andrew Bray for their contribution rethinking how and which order we present this material as well as their work as original authors of the interactive tutorial content,
Yanina Bellini Saibene, Florencia D’Andrea, and Roxana Noelia Villafañe for their work on creating the interactive tutorials in learnr,
Peter Baumgartner for review and revisions of the interactive learnr tutorials,
Will Gray for conceptual diagrams,
Allison Theobold, Melinda Yager, and Randy Prium for their valuable feedback and review of the book,
Colin Rundel for feedback on content and technical help with conversion from LaTeX to R Markdown,
Christophe Dervieux for help with multi-output bookdown issues, and
Müge Çetinkaya and Meenal Patel for their design vision.

We would like to also thank the developers of the open-source tools that make the development and authoring of this book possible, e.g., Quarto, tidyverse, tidymodels, and icons8.

We are also grateful to the many teachers, students, and other readers who have helped improve OpenIntro resources through their feedback.

Guided Practice problems are intended to stretch your thinking, and you can check yourself by reviewing the footnote solution for any Guided Practice.↩︎
Mine Çetinkaya-Rundel and David Diez and Andrew Bray and Albert Y. Kim and Ben Baumer and Chester Ismay and Nick Paterno and Christopher Barr (2024). openintro: Data Sets and Supplemental Functions from ‘OpenIntro’ Textbooks and Labs. R package version 2.4.0. https://github.com/openintrostat/openintro.↩︎