Red Wines Exploration by Hadrien Lacroix

References: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

More information about this dataset can be found here.

This tidy dataset contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

We will conduct an Exploratory Data Analysis in order to develop intuition about this dataset, extract insights that may uncover relevant questions, and eventually prepare the development of predictive models.

A first question that comes to mind is:

Which chemical properties influence the quality of red wines?

We will start by conducting univariate analyses to identify variables that have little or no impact on wine quality, focusing on the variation of the variables.

Bivariate analyses will allow us to look deeper into the relationship between retained variables and quality. This should enable us to identify critical variables.

These critical variables will be further explored with multivariate analysis. We should then be able to make predictions about wine quality based on its chemical properties.

Loading Packages and Dataset

Univariate Analysis

General information

First of all, let’s get to know our dataset a little better.

Dimensions:

## [1] 1599   13

These are the names of our variables:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

We’re going to rename the sulfur dioxide columns right away:

names(w)[names(w) == 'free.sulfur.dioxide'] <- 'free.SO2'
names(w)[names(w) == 'total.sulfur.dioxide'] <- 'total.SO2'

Summary:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides          free.SO2       total.SO2     
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00   Min.   :  6.00  
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00   1st Qu.: 22.00  
##  Median : 2.200   Median :0.07900   Median :14.00   Median : 38.00  
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87   Mean   : 46.47  
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00   3rd Qu.: 62.00  
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00   Max.   :289.00  
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Let’s begin by looking at the distribution of our wines in terms of our different variables

##  int [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
##  num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  NULL
##  NULL
##  num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
##  NULL
##  num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...

Dropping the X variable

The ‘X’ variable serves as an index, and we won’t need it here. Let’s get rid of it right now so we don’t have to subset our dataframe all along the analysis.

w <- subset(w, select = - X)

Variable distribution

To begin, we’re going to look at the individual distribution of our variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

In theory, grades can range from 0 to 10. Effectively, they range from 3 to 8, with a median at 6 and a mean at 5.636.

Quality follows a normal distribution. As such, we have little data regarding very low and very high grades, and mustb be cautious when drawing conclusions from these.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed acidity transformed logarithmically follows a normal distribution. Most values range from 4.60 to about 14 g / dm^3, with a few between 14 and 16. The mean is 8.32 and the median 7.9.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volatile acidity transformed logarithmically follows a normal distribution. Values range from 0.12 to 1.58 g / dm^3, with a mean at 0.5278 and a median at 0.52.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Although we can’t call it right skewed because values remain relatively even compared to one another, low citric wines are more numerous than high citric wine. Values range from 0 to 1 g / dm^3, but values at 1 are outliers. The mean is 0.271 and the median 0.26. The range is short: this value might be negligeable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

For residual sugar we have a right skewed distribution with a few outliers above 10 g / dm^3.

Values range from 0.9 to 15.5 g / dm^3. The mean is 2.539 and the median 2.2.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides follow a right-skewed distribution as well. They range from 0.012 to 0.611 g / dm^3, with three clusters. The range is extremely small, the impact of this variable might be negligeable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Most wines have low free S02: the higher the Free S02 level, the less the count. We have an outlier around 68 mh / dm^3. Values range from 1 to 72 g / dm^3, with a mean at 15.87 and a median at 14.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Once again, wines with a low total S02 level are more numerous, and the higher the level, the less wine there is in our sample. Values range from 6 to 289 mg / dm^3, with a mean at 46.47 and a median at 38.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Density follows a normal distribution, ranging from 0.9901 to 1.0037 g / cm^3, with a mean at 0.9967 and a median at 0.9968. It is distributed over over a very small range, so this variable might be negligeable too.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH follows a normal distribution ranging from 2.74 to 4.01, with a few outliers around 2.75 and above 3.75. The mean is 3.311 and the median 3.31.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates follows a normal distribution as well, a little skewed to the right. We have outliers around 1.6 and 1.8 g / dm^3. The values range from 0.33 to 2 g / dm^3, with a mean at 0.6581 and a median at 62.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The alcohol distribution is right skewed, ranging from 8.40 degrees by volume to 14.90. We have outliers below 9, and above 14. The mean is 10.42 and the median 10.20.

Structure of the dataset

  • Although the quality column should hold values between 0 and 10, in reality our values range from 3 to 8
  • Alcohol values range from 8.40 to 14.90
  • Free and total sulfur dioxide, fixed and volatile acidity, as well as citric acid, are distributed over a wide range of values
  • Density and pH seem to follow a normal distribution
  • Fixed acidity, volatile acidity, residual sugar and chlorides seem to have extreme outliers.
  • Free sulfur dioxide, total sulfur dioxide and sulphates are right skewed.

Insights and features of interest

The main feature of interest of this dataset is the quality variable, which is supposedly impacted by all the other variables.

Most of our variables were highly right skewed, so we had to use logarithmic transformations.

Density and chlorides are distributed over very small ranges. No matter the expertise of the three oenologists that graded the wines, it would be unimaginable to distinguish variations over such a small range. Therefore, it is likely that these variables had a negligeable impact over the final quality values.

Fixed acidity and alcohol, on the other hand, may very well have an important weight in the final grade.

Finally, because the quality histogram follows a gaussian distribution, we should be cautious regarding our our analyses and conclusions about low and high quality wines.

Bivariate analysis

First of all, we’re going to generate a correlation matrix, to gain general insights about the relationship between all of our variables.

Exploring correlations

Positive correlations are blue, negative correlations are red. Strong correlations are big and dark, weak correlations are small and light.

Here, we see that quality seems to be strongly tied to alcohol and volatile acidity, and to a lesser extent to sulphates and citric acid.

For the sake of our general wine chemical properties erudition, let’s also note that: - density is strongly correlated to fixed acidity and alcohol - pH is strongly correlated to fixed acidity and citric acid

It makes sense that citric acid, volatile acidity and fixed acidity are correlated, as well as free and total sulfur dioxide:

## 
##  Pearson's product-moment correlation
## 
## data:  w$free.SO2 and w$total.SO2
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

Total Sulfur Dioxide is made up of the amount of free and bound forms of SO2, so it makes sense to witness a strong regular correlation between both.

## 
##  Pearson's product-moment correlation
## 
## data:  w$citric.acid and w$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

Citric acid is a fixed acid, and adds freshness and flavor to the wine. On the other hand, volatile acidity corresponds to the amount of acetic acid, which can lead to unpleasant vinegar taste in high levels. It makes sense the more of a fixed acid we find in a wine, the less volatile acid there is.

Echoing what we just wrote and witnessed, it also makes sense that the more citric acid there is, the higher the level of fixed acidity. Of course, the correlation is trong but not perfect, since there are other fixed acids in wine (malic acid, tartaric acid…)

## 
##  Pearson's product-moment correlation
## 
## data:  w$citric.acid and w$fixed.acidity
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

## 
##  Pearson's product-moment correlation
## 
## data:  w$volatile.acidity and w$fixed.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3013681 -0.2097433
## sample estimates:
##        cor 
## -0.2561309

And to finish stating the obvious, the higher the level of volatile acidity, the lower the level of fixed acidity.

Validating insights with further bivariate analysis

## 
##  Pearson's product-moment correlation
## 
## data:  w$quality and w$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## w$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## w$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## w$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## w$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## w$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## w$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

There’s a moderately strong correlation between quality and alcohol: 0.476

Here, we can see that our quality grade definitely goes up with the alcohol rate medians. It would indicate that alcohol has an important impact on quality.

We can see here that quality seems increase with the rate of alcohol: - For the grade 5, the second and third quartiles are between 9.40 and 10.20 degrees of alcohol, and the median is 9.70 - For the grade 6, they are between 9.80 and 11.30, and the median is 10.50 - For the grade 7, they are between 10.80 and 12.10, and the median is 11.50 - For the grade 8, they are between 11.32 and 12.88, and the median is 12.15

## 
##  Pearson's product-moment correlation
## 
## data:  w$quality and w$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## w$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## w$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## w$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## w$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## w$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## w$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

There’s a moderately strong inverse correlation between quality and volatile acidity: -0.391

We can see here that quality seems increase with the rate of alcohol: - For the grade 3, the second and third quartiles are between 0.6475 and 1.01 degrees of alcohol, and the median is 0.845 - For the grade 4, they are between 0.53 and 0.87, and the median is 0.67 - For the grade 5, they are between 0.46 and 0.67, and the median is 0.58 - For the grade 6, they are between 0.38 and 0.60, and the median is 0.49 - For the grade 7, they are between 0.30 and 0.485, and the median is 0.37 - For the grade 8, they are between 0.335 and 0.4725, and the median is 0.37

## 
##  Pearson's product-moment correlation
## 
## data:  w$quality and w$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## w$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## w$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## w$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## w$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## w$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## w$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

There’s a small correlation between quality and sulphates level: 0.251

Once again we can see a correlation, although it is definitely weaker than what we witnessed before: - For the grade 3, the second and third quartiles are between 0.5125 and 0.6150 degrees of alcohol, and the median is 0.5450 - For the grade 8, the second and third quartiles are between 0.6900 and 0.8200 degrees of alcohol, and the median is 0.7400

Insights and features of interest

We’ve found that quality is correlated with alcohol and sulphates, and inversely correlated with volatile acidity.

We’ve confirmed that in general, the higher the rate of alcohol, the higher the rate of sulfates and the lower the level of volatile acidity, the better the grade.

The relationship is especially strong between quality and alcohol, and extremely regular with volatile acidity.

We’ve also seen that: - free sulfur dioxide is positively correlated with total sulfur dioxide - citric acid is positively correlated with fixed acidity - citric acid is inversely correlated with volatile acidity - fixed acidity is inversely correlated with volatile acidity

Multivariate analysis

Scatter plot of Quality by Alcohol and Volatile Acidity

Let’s first focus our multivariate analysis about quality on the alcohol and volatile acidity variables, since our bivariate analysis showed they seem to be the more impactful.

We will omit the top 1% of the volatile acidity values to eliminate outliers, as depicted below:

ggplot(w, aes( x = 1, y = volatile.acidity)) + 
  geom_jitter(alpha = 0.1 ) +
  geom_boxplot(alpha = 0.2, color = 'red' )

Most of our data has a volatile acidity between 0.4 and a little above 0.6.

We can identify a cluster here, loosely in the 11 to 13 range for alcohol degree, and 0.2 and 0.4 for volatile acidity, where dots tend to be high quality green. The higher the volatile acidity, the hotter the color. The same holds true for low alcohol levels, where yellow dominates.

Scatter plot of Quality by Alcohol and Sulphates

Sulphates seemed to be another impactful variable. Let’s plot it against alcohol and quality.

Once again, we can identify a cluster where the highest quality wines have an alcohol rate loosely in the 11 to 13 range, and a sulphate level between 0.6 and 0.9.

Insights

Quality seems to be impacted at the same time by alcohol rate, volatile acidity and sulphates levels.

Laying down the foundations for a model

summary(loess(I(quality) ~ I(volatile.acidity+ alcohol), data = w))
## Call:
## loess(formula = I(quality) ~ I(volatile.acidity + alcohol), data = w)
## 
## Number of Observations: 1599 
## Equivalent Number of Parameters: 5.13 
## Residual Standard Error: 0.7306 
## Trace of smoother matrix: 5.61  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE

Final Plots and Summary

Wrap up and final plots

This dataset contains 1,599 red wines with 11 variables on the chemical properties of the wine.

Univariate analysis enabled us to understand the distribution of each variable, and to eliminate chlorides and density as impactful variables.

Our bivariate analysis allowed us to identify chlorides and density as negligeable variables, because of their short range implying a difficulty to distinguish a real impact. Alcohol, volatile acidity and sulphates, on the other hand, were identified as potentially being the most impactful variables on quality. The first two follow a normal distribution like the quality variable.

Our bivariate analysis confirmed the insights brought out by our univariate analysis. When quality goes up, the volatile acidity median goes down, and the alchohol median goes up. The sulphates median goes up as well, although to a lesser extent.

Finally, our multivariate analysis confirmed our conjecture. We hypothesized that quality was linked to a high enough degree of alcohol, a low degree of volatile acidity and possibly, a high enough level of sulphates.

Our analysis clearly showed that there is a cluster of good quality wines, loosely in the 11 to 13 range for alcohol degree, 0.2 and 0.4 g / dm^3 for volatile acidity, and a sulphate level between 0.6 and 0.9 g / dm3.

Final reflections

Insights

Alcohol rate, volatile acidity and sulphate levels all impact the final grade a wine receives from a pannel of experts.

Limitations

We have to keep in mind the limitations of this model.

The dataset only contains 1,599 red wines. It is in no way representative of all the red wines across the world. Other variables may impact the quality of the wine: its preservation, its origin, its age, its cépage…

We also stated that we have very few values regarding low and high quality wine, so our conclusions must be taken cautiously.

Future work

I’d be interested in working with a dataset with additional properties, as written above: age, cépage, origin…

It would also be great to conduct a similar analysis about white wine, to see if the impactful variables are the same. Then, an analysis comparing red and white wine would be interesting.

And being from Reims, I’m definitely going to look for a Champagne dataset!