Ben & Jerry’s: Who Buys and What Drives Spending?

Beverages
Author

Ryan Horn

Published

February 11, 2026

Ben & Jerry’s Ice Cream

Introduction

This blog post is going to explore purchasing behavior for Ben & Jerry’s ice cream using the “ice_cream” dataset.

We will examine pricing, household characteristics, coupon usage, and regional patterns to understand what factors influence dollar spending on ice cream.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ice_cream <- read_csv("https://bcdanl.github.io/data/ben-and-jerry-cleaned.csv")
Rows: 21974 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): flavor_descr, size1_descr, region, race
dbl (5): priceper1, household_id, household_income, household_size, couponper1
lgl (8): usecoup, married, hispanic_origin, microwave, dishwasher, sfh, inte...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ice_cream2 <- ice_cream %>% 
  mutate(
    effective_price = priceper1 - couponper1
  )

Descriptive Statistics

summary_stats <- ice_cream %>% 
  summarise(
    mean_price = mean(priceper1, na.rm = TRUE),
    median_price = median(priceper1, na.rm = TRUE),
    sd_price = sd(priceper1, na.rm = TRUE),
    avg_household_size = mean(household_size, na.rm = TRUE)
  )

show(summary_stats)
# A tibble: 1 × 4
  mean_price median_price sd_price avg_household_size
       <dbl>        <dbl>    <dbl>              <dbl>
1       3.31         3.34    0.666               2.46

Filtering

coupon_users <- ice_cream %>% 
  filter(usecoup == TRUE)
show(coupon_users)
# A tibble: 2,345 × 17
   priceper1 flavor_descr              size1_descr household_id household_income
       <dbl> <chr>                     <chr>              <dbl>            <dbl>
 1      3.41 CAKE BATTER               16.0 MLOZ        2001456           130000
 2      3    PISTACHIO PISTACHIO       16.0 MLOZ        2002721           210000
 3      4    PISTACHIO PISTACHIO       16.0 MLOZ        2002721           210000
 4      3.99 CHC CHIP C-DH             16.0 MLOZ        2004690            80000
 5      3.52 CHERRY GRCA               16.0 MLOZ        2002800            70000
 6      3    NEW YORK SUPER FUDGE CHU… 16.0 MLOZ        2003661           150000
 7      3.2  CAKE BATTER               16.0 MLOZ        2003661           150000
 8      3.25 CHERRY GRCA               16.0 MLOZ        2003661           150000
 9      3.5  PHISH FOOD                16.0 MLOZ        2003661           150000
10      3.99 MINT CHC CHUNK            16.0 MLOZ        2003661           150000
# ℹ 2,335 more rows
# ℹ 12 more variables: household_size <dbl>, usecoup <lgl>, couponper1 <dbl>,
#   region <chr>, married <lgl>, race <chr>, hispanic_origin <lgl>,
#   microwave <lgl>, dishwasher <lgl>, sfh <lgl>, internet <lgl>, tvcable <lgl>

Group by: Coupon users

compare <- ice_cream %>% 
  group_by(usecoup) %>% 
  summarise(
    avg_price = mean(priceper1, na.rm = TRUE),
    count = n()
  )
show(compare)
# A tibble: 2 × 3
  usecoup avg_price count
  <lgl>       <dbl> <int>
1 FALSE        3.31 19629
2 TRUE         3.38  2345

Regression Model: What Drives Price Paid?

library(modelsummary)
Warning: package 'modelsummary' was built under R version 4.3.3
model <- lm(priceper1 ~ usecoup + household_income + household_size + region,
            data = ice_cream)

modelsummary(model)
(1)
(Intercept) 3.453
(0.018)
usecoupTRUE 0.058
(0.014)
household_income -0.000
(0.000)
household_size -0.030
(0.003)
regionEast 0.138
(0.014)
regionSouth -0.021
(0.012)
regionWest 0.042
(0.013)
Num.Obs. 21974
R2 0.016
R2 Adj. 0.015
AIC 44137.4
BIC 44201.3
Log.Lik. -22060.681
RMSE 0.66

Regression Interpretation

The regression results suggest that regional differences are the strongest predictors of price variation, with the East region showing the highest unit prices. Coupon usage is associated with slightly higher unit prices, potentially reflecting usage on premium products. Household size has a small negative effect on unit price, possibly due to bulk purchasing behavior. Household income does not appear to meaningfully influence price paid. Overall, the model explains a modest portion of price variation, indicating that additional factors beyond household characteristics likely influence pricing.

Price Distribution by Region

ggplot(ice_cream, aes(x = region, y = priceper1)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Unit Price Distribution by Region",
    x = "Region",
    y = "Price per Unit ($)"
  ) +
  theme_minimal()

The boxplot confirms regional variation in pricing, with the East showing the highest median unit price. Coupon usage does not appear to reduce the unit price substantially, supporting the regression findings.

ggplot(ice_cream2, aes(x =usecoup, y = effective_price))+
  geom_boxplot(fill = "steelblue") +
  facet_wrap(~ region) +
  labs(
    title = "Effective Price by Coupon Usage Across Regions",
    x = "Used Coupon",
    y = "Effective Price ($)"
  ) +
  theme_minimal()

ggplot(ice_cream2, aes(x = effective_price)) +
  geom_histogram(bins = 30, fill = "darkgreen", alpha = 0.7) +
  labs(
    title = "Distribution of Effective Unit Prices",
    x = "Effective Price ($)",
    y = "Count"
  ) +
  theme_minimal()