References: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
More information about this dataset can be found here.
This tidy dataset contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
We will conduct an Exploratory Data Analysis in order to develop intuition about this dataset, extract insights that may uncover relevant questions, and eventually prepare the development of predictive models.
A first question that comes to mind is:
Which chemical properties influence the quality of red wines?
We will start by conducting univariate analyses to identify variables that have little or no impact on wine quality, focusing on the variation of the variables.
Bivariate analyses will allow us to look deeper into the relationship between retained variables and quality. This should enable us to identify critical variables.
These critical variables will be further explored with multivariate analysis. We should then be able to make predictions about wine quality based on its chemical properties.
First of all, let’s get to know our dataset a little better.
Dimensions:
## [1] 1599 13
These are the names of our variables:
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
We’re going to rename the sulfur dioxide columns right away:
names(w)[names(w) == 'free.sulfur.dioxide'] <- 'free.SO2'
names(w)[names(w) == 'total.sulfur.dioxide'] <- 'total.SO2'
Summary:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.SO2 total.SO2
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Let’s begin by looking at the distribution of our wines in terms of our different variables
## int [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
## num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## NULL
## NULL
## num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
## NULL
## num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
The ‘X’ variable serves as an index, and we won’t need it here. Let’s get rid of it right now so we don’t have to subset our dataframe all along the analysis.
w <- subset(w, select = - X)
To begin, we’re going to look at the individual distribution of our variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
In theory, grades can range from 0 to 10. Effectively, they range from 3 to 8, with a median at 6 and a mean at 5.636.
Quality follows a normal distribution. As such, we have little data regarding very low and very high grades, and mustb be cautious when drawing conclusions from these.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity transformed logarithmically follows a normal distribution. Most values range from 4.60 to about 14 g / dm^3, with a few between 14 and 16. The mean is 8.32 and the median 7.9.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Volatile acidity transformed logarithmically follows a normal distribution. Values range from 0.12 to 1.58 g / dm^3, with a mean at 0.5278 and a median at 0.52.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Although we can’t call it right skewed because values remain relatively even compared to one another, low citric wines are more numerous than high citric wine. Values range from 0 to 1 g / dm^3, but values at 1 are outliers. The mean is 0.271 and the median 0.26. The range is short: this value might be negligeable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
For residual sugar we have a right skewed distribution with a few outliers above 10 g / dm^3.
Values range from 0.9 to 15.5 g / dm^3. The mean is 2.539 and the median 2.2.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides follow a right-skewed distribution as well. They range from 0.012 to 0.611 g / dm^3, with three clusters. The range is extremely small, the impact of this variable might be negligeable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Most wines have low free S02: the higher the Free S02 level, the less the count. We have an outlier around 68 mh / dm^3. Values range from 1 to 72 g / dm^3, with a mean at 15.87 and a median at 14.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Once again, wines with a low total S02 level are more numerous, and the higher the level, the less wine there is in our sample. Values range from 6 to 289 mg / dm^3, with a mean at 46.47 and a median at 38.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Density follows a normal distribution, ranging from 0.9901 to 1.0037 g / cm^3, with a mean at 0.9967 and a median at 0.9968. It is distributed over over a very small range, so this variable might be negligeable too.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH follows a normal distribution ranging from 2.74 to 4.01, with a few outliers around 2.75 and above 3.75. The mean is 3.311 and the median 3.31.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates follows a normal distribution as well, a little skewed to the right. We have outliers around 1.6 and 1.8 g / dm^3. The values range from 0.33 to 2 g / dm^3, with a mean at 0.6581 and a median at 62.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The alcohol distribution is right skewed, ranging from 8.40 degrees by volume to 14.90. We have outliers below 9, and above 14. The mean is 10.42 and the median 10.20.
The main feature of interest of this dataset is the quality variable, which is supposedly impacted by all the other variables.
Most of our variables were highly right skewed, so we had to use logarithmic transformations.
Density and chlorides are distributed over very small ranges. No matter the expertise of the three oenologists that graded the wines, it would be unimaginable to distinguish variations over such a small range. Therefore, it is likely that these variables had a negligeable impact over the final quality values.
Fixed acidity and alcohol, on the other hand, may very well have an important weight in the final grade.
Finally, because the quality histogram follows a gaussian distribution, we should be cautious regarding our our analyses and conclusions about low and high quality wines.
First of all, we’re going to generate a correlation matrix, to gain general insights about the relationship between all of our variables.
Positive correlations are blue, negative correlations are red. Strong correlations are big and dark, weak correlations are small and light.
Here, we see that quality seems to be strongly tied to alcohol and volatile acidity, and to a lesser extent to sulphates and citric acid.
For the sake of our general wine chemical properties erudition, let’s also note that: - density is strongly correlated to fixed acidity and alcohol - pH is strongly correlated to fixed acidity and citric acid
It makes sense that citric acid, volatile acidity and fixed acidity are correlated, as well as free and total sulfur dioxide:
##
## Pearson's product-moment correlation
##
## data: w$free.SO2 and w$total.SO2
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
Total Sulfur Dioxide is made up of the amount of free and bound forms of SO2, so it makes sense to witness a strong regular correlation between both.
##
## Pearson's product-moment correlation
##
## data: w$citric.acid and w$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Citric acid is a fixed acid, and adds freshness and flavor to the wine. On the other hand, volatile acidity corresponds to the amount of acetic acid, which can lead to unpleasant vinegar taste in high levels. It makes sense the more of a fixed acid we find in a wine, the less volatile acid there is.
Echoing what we just wrote and witnessed, it also makes sense that the more citric acid there is, the higher the level of fixed acidity. Of course, the correlation is trong but not perfect, since there are other fixed acids in wine (malic acid, tartaric acid…)
##
## Pearson's product-moment correlation
##
## data: w$citric.acid and w$fixed.acidity
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
##
## Pearson's product-moment correlation
##
## data: w$volatile.acidity and w$fixed.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3013681 -0.2097433
## sample estimates:
## cor
## -0.2561309
And to finish stating the obvious, the higher the level of volatile acidity, the lower the level of fixed acidity.
##
## Pearson's product-moment correlation
##
## data: w$quality and w$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
## w$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## w$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## w$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## w$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## w$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## w$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
There’s a moderately strong correlation between quality and alcohol: 0.476
Here, we can see that our quality grade definitely goes up with the alcohol rate medians. It would indicate that alcohol has an important impact on quality.
We can see here that quality seems increase with the rate of alcohol: - For the grade 5, the second and third quartiles are between 9.40 and 10.20 degrees of alcohol, and the median is 9.70 - For the grade 6, they are between 9.80 and 11.30, and the median is 10.50 - For the grade 7, they are between 10.80 and 12.10, and the median is 11.50 - For the grade 8, they are between 11.32 and 12.88, and the median is 12.15
##
## Pearson's product-moment correlation
##
## data: w$quality and w$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
## w$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## w$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## w$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## w$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## w$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## w$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
There’s a moderately strong inverse correlation between quality and volatile acidity: -0.391
We can see here that quality seems increase with the rate of alcohol: - For the grade 3, the second and third quartiles are between 0.6475 and 1.01 degrees of alcohol, and the median is 0.845 - For the grade 4, they are between 0.53 and 0.87, and the median is 0.67 - For the grade 5, they are between 0.46 and 0.67, and the median is 0.58 - For the grade 6, they are between 0.38 and 0.60, and the median is 0.49 - For the grade 7, they are between 0.30 and 0.485, and the median is 0.37 - For the grade 8, they are between 0.335 and 0.4725, and the median is 0.37
##
## Pearson's product-moment correlation
##
## data: w$quality and w$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
## w$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## w$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## w$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## w$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## w$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## w$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
There’s a small correlation between quality and sulphates level: 0.251
Once again we can see a correlation, although it is definitely weaker than what we witnessed before: - For the grade 3, the second and third quartiles are between 0.5125 and 0.6150 degrees of alcohol, and the median is 0.5450 - For the grade 8, the second and third quartiles are between 0.6900 and 0.8200 degrees of alcohol, and the median is 0.7400
We’ve found that quality is correlated with alcohol and sulphates, and inversely correlated with volatile acidity.
We’ve confirmed that in general, the higher the rate of alcohol, the higher the rate of sulfates and the lower the level of volatile acidity, the better the grade.
The relationship is especially strong between quality and alcohol, and extremely regular with volatile acidity.
We’ve also seen that: - free sulfur dioxide is positively correlated with total sulfur dioxide - citric acid is positively correlated with fixed acidity - citric acid is inversely correlated with volatile acidity - fixed acidity is inversely correlated with volatile acidity
Let’s first focus our multivariate analysis about quality on the alcohol and volatile acidity variables, since our bivariate analysis showed they seem to be the more impactful.
We will omit the top 1% of the volatile acidity values to eliminate outliers, as depicted below:
ggplot(w, aes( x = 1, y = volatile.acidity)) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' )
Most of our data has a volatile acidity between 0.4 and a little above 0.6.
We can identify a cluster here, loosely in the 11 to 13 range for alcohol degree, and 0.2 and 0.4 for volatile acidity, where dots tend to be high quality green. The higher the volatile acidity, the hotter the color. The same holds true for low alcohol levels, where yellow dominates.
Sulphates seemed to be another impactful variable. Let’s plot it against alcohol and quality.
Once again, we can identify a cluster where the highest quality wines have an alcohol rate loosely in the 11 to 13 range, and a sulphate level between 0.6 and 0.9.
Quality seems to be impacted at the same time by alcohol rate, volatile acidity and sulphates levels.
summary(loess(I(quality) ~ I(volatile.acidity+ alcohol), data = w))
## Call:
## loess(formula = I(quality) ~ I(volatile.acidity + alcohol), data = w)
##
## Number of Observations: 1599
## Equivalent Number of Parameters: 5.13
## Residual Standard Error: 0.7306
## Trace of smoother matrix: 5.61 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
This dataset contains 1,599 red wines with 11 variables on the chemical properties of the wine.
Univariate analysis enabled us to understand the distribution of each variable, and to eliminate chlorides and density as impactful variables.
Our bivariate analysis allowed us to identify chlorides and density as negligeable variables, because of their short range implying a difficulty to distinguish a real impact. Alcohol, volatile acidity and sulphates, on the other hand, were identified as potentially being the most impactful variables on quality. The first two follow a normal distribution like the quality variable.
Our bivariate analysis confirmed the insights brought out by our univariate analysis. When quality goes up, the volatile acidity median goes down, and the alchohol median goes up. The sulphates median goes up as well, although to a lesser extent.
Finally, our multivariate analysis confirmed our conjecture. We hypothesized that quality was linked to a high enough degree of alcohol, a low degree of volatile acidity and possibly, a high enough level of sulphates.
Our analysis clearly showed that there is a cluster of good quality wines, loosely in the 11 to 13 range for alcohol degree, 0.2 and 0.4 g / dm^3 for volatile acidity, and a sulphate level between 0.6 and 0.9 g / dm3.
Alcohol rate, volatile acidity and sulphate levels all impact the final grade a wine receives from a pannel of experts.
We have to keep in mind the limitations of this model.
The dataset only contains 1,599 red wines. It is in no way representative of all the red wines across the world. Other variables may impact the quality of the wine: its preservation, its origin, its age, its cépage…
We also stated that we have very few values regarding low and high quality wine, so our conclusions must be taken cautiously.
I’d be interested in working with a dataset with additional properties, as written above: age, cépage, origin…
It would also be great to conduct a similar analysis about white wine, to see if the impactful variables are the same. Then, an analysis comparing red and white wine would be interesting.
And being from Reims, I’m definitely going to look for a Champagne dataset!