The ‘diamonds’ dataset is one of the datasets provided with the ggplot2 R package. We’re going to see if we can predict the price of a diamond based on its characteristics.
We will first conduct an EDA to get to know the data and analyze the impact of the different variables. We will then push the analysis further in order to build a linear model and use it to predict prices.
NOTE: This notebook contains two analyses condensed in one. While they definitely are related, there are a few points that need to be revised to help the reader follow the logic easily. This will be done in the near future, but we found that this notebook was interesting enough to be published in its raw form, waiting for revision.
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
Loading required package: lattice
Loading required package: MASS
Attaching package: ‘memisc’
The following object is masked from ‘package:scales’:
percent
The following objects are masked from ‘package:stats’:
contr.sum, contr.treatment, contrasts
The following object is masked from ‘package:base’:
as.array
library(RColorBrewer)
data("diamonds")
Dimensions of the dataset:
dim(diamonds)
[1] 53940 10
Name of the variables:
str(diamonds)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num 55 61 65 58 58 57 57 55 61 61 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Summary:
summary(diamonds)
carat cut color clarity depth table price
Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00 Min. :43.00 Min. : 326
1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950
Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80 Median :57.00 Median : 2401
Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75 Mean :57.46 Mean : 3933
3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324
Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00 Max. :95.00 Max. :18823
J: 2808 (Other): 2531
x y z
Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median : 5.700 Median : 5.710 Median : 3.530
Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :10.740 Max. :58.900 Max. :31.800
Levels of our cateogricalvariables:
levels(diamonds$cut)
[1] "Fair" "Good" "Very Good" "Premium" "Ideal"
levels(diamonds$color)
[1] "D" "E" "F" "G" "H" "I" "J"
levels(diamonds$clarity)
[1] "I1" "SI2" "SI1" "VS2" "VS1" "VVS2" "VVS1" "IF"
Let’s jump right into it and focus our analysis on price.
qplot(data = diamonds, x = price)
summary(diamonds$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
326 950 2401 3933 5324 18823
The distribution of diamonds prices is clearly right-skewed.
Let’s get some numbers:
sum(diamonds$price < 500)
[1] 1729
sum(diamonds$price < 250)
[1] 0
sum(diamonds$price >= 15000)
[1] 1656
Our dataset contains: - 1729 diamonds with a price below $500 - 0 diamonds with a price below $250 - 15,000 diamonds with a price equal to or above $15,000
Let’s get a look at the cheapest diamonds:
qplot(data = diamonds, x = price,
binwidth = 20) +
scale_x_continuous(limits = c(0, 1500), breaks = seq(0, 1500, 100))
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Mode(diamonds$price)
[1] 605
The mode of the cheapest diamonds (with a price between $0 and $1,500) is 605.
Let’s facet our prices by the quality of the cut.
qplot(data = diamonds, x = price) +
facet_wrap(~cut, ncol= 2, scales = 'free_y')
by(diamonds$price, diamonds$cut, summary)
diamonds$cut: Fair
Min. 1st Qu. Median Mean 3rd Qu. Max.
337 2050 3282 4359 5206 18574
---------------------------------------------------------------------------------------------
diamonds$cut: Good
Min. 1st Qu. Median Mean 3rd Qu. Max.
327 1145 3050 3929 5028 18788
---------------------------------------------------------------------------------------------
diamonds$cut: Very Good
Min. 1st Qu. Median Mean 3rd Qu. Max.
336 912 2648 3982 5373 18818
---------------------------------------------------------------------------------------------
diamonds$cut: Premium
Min. 1st Qu. Median Mean 3rd Qu. Max.
326 1046 3185 4584 6296 18823
---------------------------------------------------------------------------------------------
diamonds$cut: Ideal
Min. 1st Qu. Median Mean 3rd Qu. Max.
326 878 1810 3458 4678 18806
qplot(data = diamonds, x = price / carat, binwidth = 0.1) +
facet_wrap(~cut) +
scale_x_log10()
The price per carat definitely seems to increase with the quality of the cut.
qplot(data = diamonds, x = clarity, y = price, geom = 'boxplot')
by(diamonds$price, diamonds$clarity, summary)
diamonds$clarity: I1
Min. 1st Qu. Median Mean 3rd Qu. Max.
345 2080 3344 3924 5161 18531
---------------------------------------------------------------------------------------------
diamonds$clarity: SI2
Min. 1st Qu. Median Mean 3rd Qu. Max.
326 2264 4072 5063 5777 18804
---------------------------------------------------------------------------------------------
diamonds$clarity: SI1
Min. 1st Qu. Median Mean 3rd Qu. Max.
326 1089 2822 3996 5250 18818
---------------------------------------------------------------------------------------------
diamonds$clarity: VS2
Min. 1st Qu. Median Mean 3rd Qu. Max.
334 900 2054 3925 6024 18823
---------------------------------------------------------------------------------------------
diamonds$clarity: VS1
Min. 1st Qu. Median Mean 3rd Qu. Max.
327 876 2005 3839 6023 18795
---------------------------------------------------------------------------------------------
diamonds$clarity: VVS2
Min. 1st Qu. Median Mean 3rd Qu. Max.
336.0 794.2 1311.0 3283.7 3638.2 18768.0
---------------------------------------------------------------------------------------------
diamonds$clarity: VVS1
Min. 1st Qu. Median Mean 3rd Qu. Max.
336 816 1093 2523 2379 18777
---------------------------------------------------------------------------------------------
diamonds$clarity: IF
Min. 1st Qu. Median Mean 3rd Qu. Max.
369 895 1080 2865 2388 18806
qplot(data = diamonds, x = color, y = price, geom = 'boxplot')
by(diamonds$price, diamonds$color, summary)
diamonds$color: D
Min. 1st Qu. Median Mean 3rd Qu. Max.
357 911 1838 3170 4214 18693
---------------------------------------------------------------------------------------------
diamonds$color: E
Min. 1st Qu. Median Mean 3rd Qu. Max.
326 882 1739 3077 4003 18731
---------------------------------------------------------------------------------------------
diamonds$color: F
Min. 1st Qu. Median Mean 3rd Qu. Max.
342 982 2344 3725 4868 18791
---------------------------------------------------------------------------------------------
diamonds$color: G
Min. 1st Qu. Median Mean 3rd Qu. Max.
354 931 2242 3999 6048 18818
---------------------------------------------------------------------------------------------
diamonds$color: H
Min. 1st Qu. Median Mean 3rd Qu. Max.
337 984 3460 4487 5980 18803
---------------------------------------------------------------------------------------------
diamonds$color: I
Min. 1st Qu. Median Mean 3rd Qu. Max.
334 1120 3730 5092 7202 18823
---------------------------------------------------------------------------------------------
diamonds$color: J
Min. 1st Qu. Median Mean 3rd Qu. Max.
335 1860 4234 5324 7695 18710
qplot(data = diamonds, x = color, y = price / carat, geom = 'boxplot') +
coord_cartesian(ylim = c(250, 6000))
qplot(data = diamonds, x = carat, binwidth = 0.01, geom = 'freqpoly') +
scale_x_continuous(limits = c(0, 1.5), breaks = seq(0, 1.5, 0.1))
ggplot(aes(x = x , y = price), data = diamonds) +
geom_jitter(alpha= 1/20) +
xlim(3, 9)
cor.test(diamonds$price, diamonds$x)
Pearson's product-moment correlation
data: diamonds$price and diamonds$x
t = 440.16, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8825835 0.8862594
sample estimates:
cor
0.8844352
cor.test(diamonds$price, diamonds$y)
Pearson's product-moment correlation
data: diamonds$price and diamonds$y
t = 401.14, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8632867 0.8675241
sample estimates:
cor
0.8654209
cor.test(diamonds$price, diamonds$z)
Pearson's product-moment correlation
data: diamonds$price and diamonds$z
t = 393.6, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8590541 0.8634131
sample estimates:
cor
0.8612494
ggplot(aes(x = depth, y = price), data = diamonds) +
geom_point(alpha = 0.05,
position = position_jitter(h = 0),
color = 'orange')
cor.test(diamonds$price, diamonds$depth)
Pearson's product-moment correlation
data: diamonds$price and diamonds$depth
t = -2.473, df = 53938, p-value = 0.0134
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.019084756 -0.002208537
sample estimates:
cor
-0.0106474
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point() +
xlim(0, quantile(diamonds$carat, 0.99)) +
ylim(0, quantile(diamonds$price, 0.99))
diamonds$volume <- diamonds$x * diamonds$y * diamonds$z
ggplot(aes(x = volume, y = price), data = diamonds) +
geom_point()
with(subset(diamonds, volume > 0 & volume < 800), cor(price, volume))
[1] 0.9235455
ggplot(aes(x = volume, y = price), data = subset(diamonds, volume > 0 & volume < 800)) +
geom_point(alpha = 1/20) +
geom_smooth(method = 'lm', color = 'red')
library(dplyr)
package ‘dplyr’ was built under R version 3.4.1
Attaching package: ‘dplyr’
The following objects are masked from ‘package:memisc’:
collect, recode, rename
The following object is masked from ‘package:MASS’:
select
The following object is masked from ‘package:GGally’:
nasa
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
diamondsByClarity <- diamonds %>%
group_by(clarity) %>%
summarize(mean_price = mean(price),
median_price = median(price),
min_price = min(price),
max_price = max(price),
n = n()) %>%
arrange(clarity)
head(diamondsByClarity)
diamonds_by_clarity <- group_by(diamonds, clarity)
diamonds_mp_by_clarity <- summarise(diamonds_by_clarity, mean_price = mean(price))
diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))
p1 <- ggplot(aes(x = clarity, y = mean_price), data = diamonds_mp_by_clarity) +
geom_bar(stat = 'identity')
p2 <- ggplot(aes(x = color, y = mean_price), data = diamonds_mp_by_color) +
geom_bar(stat = 'identity')
library(gridExtra)
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
grid.arrange(p1, p2, ncol = 1)
There’s something very odd here. It goes against the intuition that price goes down when color and clarity are better. Maybe there’s another variable influencing this behavior. Let’s look at the influence of the cut.
diamonds_by_cut <- group_by(diamonds, cut)
diamonds_mp_by_cut <- summarise(diamonds_by_cut, mean_price = mean(price))
p3 <- ggplot(aes(x = cut, y = mean_price), data = diamonds_mp_by_cut) +
geom_bar(stat = 'identity')
grid.arrange(p1, p2, p3, ncol = 1)
library(ggplot2)
data(diamonds)
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(fill = I('#F79420'), color = I('black'), shape = 21) +
xlim(0, quantile(diamonds$carat, .99)) +
ylim(0, quantile(diamonds$price, .99))
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(color = '#F79420', alpha = 1/4) +
stat_smooth(method = 'lm') +
scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) +
scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)))
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp,
axisLabels = 'internal',
lower = list(continuous = wrap('points', shape = I('.'))),
upper = list(combo = wrap('box', outlier.shape = I('.'))))
plot: [1,1] [=----------------------------------------------------------------------------------------------] 1% est: 0s
plot: [1,2] [==---------------------------------------------------------------------------------------------] 2% est:12s
plot: [1,3] [===--------------------------------------------------------------------------------------------] 3% est:25s
plot: [1,4] [====-------------------------------------------------------------------------------------------] 4% est:28s
plot: [1,5] [=====------------------------------------------------------------------------------------------] 5% est:31s
plot: [1,6] [======-----------------------------------------------------------------------------------------] 6% est:27s
plot: [1,7] [=======----------------------------------------------------------------------------------------] 7% est:25s
plot: [1,8] [========---------------------------------------------------------------------------------------] 8% est:22s
plot: [1,9] [=========--------------------------------------------------------------------------------------] 9% est:20s
plot: [1,10] [=========-------------------------------------------------------------------------------------] 10% est:19s
plot: [2,1] [==========-------------------------------------------------------------------------------------] 11% est:18s
plot: [2,2] [===========------------------------------------------------------------------------------------] 12% est:19s
plot: [2,3] [============-----------------------------------------------------------------------------------] 13% est:18s
plot: [2,4] [=============----------------------------------------------------------------------------------] 14% est:19s
plot: [2,5] [==============---------------------------------------------------------------------------------] 15% est:20s
plot: [2,6] [===============--------------------------------------------------------------------------------] 16% est:20s
plot: [2,7] [================-------------------------------------------------------------------------------] 17% est:20s
plot: [2,8] [=================------------------------------------------------------------------------------] 18% est:21s
plot: [2,9] [==================-----------------------------------------------------------------------------] 19% est:21s
plot: [2,10] [===================---------------------------------------------------------------------------] 20% est:21s
plot: [3,1] [====================---------------------------------------------------------------------------] 21% est:21s
plot: [3,2] [=====================--------------------------------------------------------------------------] 22% est:22s
plot: [3,3] [======================-------------------------------------------------------------------------] 23% est:22s
plot: [3,4] [=======================------------------------------------------------------------------------] 24% est:21s
plot: [3,5] [========================-----------------------------------------------------------------------] 25% est:22s
plot: [3,6] [=========================----------------------------------------------------------------------] 26% est:22s
plot: [3,7] [==========================---------------------------------------------------------------------] 27% est:22s
plot: [3,8] [===========================--------------------------------------------------------------------] 28% est:23s
plot: [3,9] [============================-------------------------------------------------------------------] 29% est:23s
plot: [3,10] [============================------------------------------------------------------------------] 30% est:23s
plot: [4,1] [=============================------------------------------------------------------------------] 31% est:23s
plot: [4,2] [==============================-----------------------------------------------------------------] 32% est:24s
plot: [4,3] [===============================----------------------------------------------------------------] 33% est:24s
plot: [4,4] [================================---------------------------------------------------------------] 34% est:25s
plot: [4,5] [=================================--------------------------------------------------------------] 35% est:25s
plot: [4,6] [==================================-------------------------------------------------------------] 36% est:25s
plot: [4,7] [===================================------------------------------------------------------------] 37% est:26s
plot: [4,8] [====================================-----------------------------------------------------------] 38% est:26s
plot: [4,9] [=====================================----------------------------------------------------------] 39% est:26s
plot: [4,10] [======================================--------------------------------------------------------] 40% est:26s
plot: [5,1] [=======================================--------------------------------------------------------] 41% est:26s
plot: [5,2] [========================================-------------------------------------------------------] 42% est:25s
plot: [5,3] [=========================================------------------------------------------------------] 43% est:24s
plot: [5,4] [==========================================-----------------------------------------------------] 44% est:24s
plot: [5,5] [===========================================----------------------------------------------------] 45% est:24s
plot: [5,6] [============================================---------------------------------------------------] 46% est:23s
plot: [5,7] [=============================================--------------------------------------------------] 47% est:22s
plot: [5,8] [==============================================-------------------------------------------------] 48% est:21s
plot: [5,9] [===============================================------------------------------------------------] 49% est:21s
plot: [5,10] [===============================================-----------------------------------------------] 50% est:20s
plot: [6,1] [================================================-----------------------------------------------] 51% est:19s
plot: [6,2] [=================================================----------------------------------------------] 52% est:19s
plot: [6,3] [==================================================---------------------------------------------] 53% est:18s
plot: [6,4] [===================================================--------------------------------------------] 54% est:18s
plot: [6,5] [====================================================-------------------------------------------] 55% est:18s
plot: [6,6] [=====================================================------------------------------------------] 56% est:17s
plot: [6,7] [======================================================-----------------------------------------] 57% est:16s
plot: [6,8] [=======================================================----------------------------------------] 58% est:16s
plot: [6,9] [========================================================---------------------------------------] 59% est:15s
plot: [6,10] [========================================================--------------------------------------] 60% est:15s
plot: [7,1] [==========================================================-------------------------------------] 61% est:14s
plot: [7,2] [===========================================================------------------------------------] 62% est:14s
plot: [7,3] [============================================================-----------------------------------] 63% est:13s
plot: [7,4] [=============================================================----------------------------------] 64% est:13s
plot: [7,5] [==============================================================---------------------------------] 65% est:13s
plot: [7,6] [===============================================================--------------------------------] 66% est:12s
plot: [7,7] [================================================================-------------------------------] 67% est:12s
plot: [7,8] [=================================================================------------------------------] 68% est:11s
plot: [7,9] [==================================================================-----------------------------] 69% est:11s
plot: [7,10] [==================================================================----------------------------] 70% est:10s
plot: [8,1] [===================================================================----------------------------] 71% est:10s
plot: [8,2] [====================================================================---------------------------] 72% est: 9s
plot: [8,3] [=====================================================================--------------------------] 73% est: 9s
plot: [8,4] [======================================================================-------------------------] 74% est: 9s
plot: [8,5] [=======================================================================------------------------] 75% est: 9s
plot: [8,6] [========================================================================-----------------------] 76% est: 8s
plot: [8,7] [=========================================================================----------------------] 77% est: 8s
plot: [8,8] [==========================================================================---------------------] 78% est: 7s
plot: [8,9] [===========================================================================--------------------] 79% est: 7s
plot: [8,10] [===========================================================================-------------------] 80% est: 7s
plot: [9,1] [=============================================================================------------------] 81% est: 6s
plot: [9,2] [==============================================================================-----------------] 82% est: 6s
plot: [9,3] [===============================================================================----------------] 83% est: 5s
plot: [9,4] [================================================================================---------------] 84% est: 5s
plot: [9,5] [=================================================================================--------------] 85% est: 5s
plot: [9,6] [==================================================================================-------------] 86% est: 5s
plot: [9,7] [===================================================================================------------] 87% est: 4s
plot: [9,8] [====================================================================================-----------] 88% est: 4s
plot: [9,9] [=====================================================================================----------] 89% est: 3s
plot: [9,10] [=====================================================================================---------] 90% est: 3s
plot: [10,1] [======================================================================================--------] 91% est: 3s
plot: [10,2] [======================================================================================--------] 92% est: 2s
plot: [10,3] [=======================================================================================-------] 93% est: 2s
plot: [10,4] [========================================================================================------] 94% est: 2s
plot: [10,5] [=========================================================================================-----] 95% est: 2s
plot: [10,6] [==========================================================================================----] 96% est: 1s
plot: [10,7] [===========================================================================================---] 97% est: 1s
plot: [10,8] [============================================================================================--] 98% est: 1s
plot: [10,9] [=============================================================================================-] 99% est: 0s
plot: [10,10] [=============================================================================================]100% est: 0s
Here is what we can see from the matrice. In the lower triangle, Ggplot uses: - grouped histograms for qualitative / qualitative pairs - scatter plots for quantitative / quantitative pairs
In the upper triangle, Ggplot uses: - grouped histograms for qualitative / qualitative pairs, this time using the x instead of the y variable as the grouping factor - box plots for qualitative / quantitative pairs - correlation coefficients for quantitative / quantitative pairs
We can see what might be relationship between price and clarity and price and color, which we will keep in mind for later when we’ll start modeling our data.
The critical factor driving price is the size, or the carat weight of the diamond. As we saw before, the relationship between price and diamond size is nonlinear. What might explain this pattern? On the supply side, larger continuous chunks of diamonds without significant flaws are probably harder to find than smaller ones. This might explain this sort of exponential looking curve.
This is related to the fact that the weight of a diamond is a function of volume, and volume is a function of the length times the width times the height of a diamond. This suggests that we might be especially interested in the cube root of carat wieght.
It’s often the case that leveraging substantive knowledge about your data like this can lead to especially fruitful transformations.
On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers. Many less than one carat customers would surely never buy a diamond, were it not for the social norm of presenting one when proposing.
There are fewer customers who can afford a bigger diamond that is larger than one carat, hence we shouldn’t expect the market for bigger diamonds to be as competitive as the one for smaller diamonds. So it makes sense that the variance as well as the price would increase with carat size.
Often, the distribution of any monetary variable like dollars will be highly skewed and vary over orders of magnitude. This can result from path dependence (the rich getting richer), or multiplicative processes like year on year inflation, or some combination of both.
Hence it’s a good idea to look into compressing any such variable by putting it on a log scale.
library(gridExtra)
plot1 <- qplot(data = diamonds, x = price, binwidth = 100, fill = I('#099DD9')) +
ggtitle('Price')
plot2 <- qplot(data = diamonds, x = price, binwidth = .01, fill = I('#F79420')) +
ggtitle('Price (log10)') +
scale_x_log10()
grid.arrange(plot1, plot2)
We can see that the prices for diamonds are pretty heavily skewed.
But when we put thos prices on a log10 scale, they seem much better behaved: they are much closer to the bell curve of a normal distribution. We can even see a little bit of evidence of bimodality on this log10 scale, which is consistent with our two class rich buyer and poor buyer speculation about the nature of customers for diamonds.
qplot(carat, price, data = diamonds) +
scale_y_continuous(trans = log10_trans()) +
ggtitle('Price (log10) by Carat')
This plot looks better than before. On the log 10 scale, the prices look less dispersed at the high end of Carat size and price, but we can do better.
We’re trying to use the cube root of Carat in light of our speculation about flaws being exponentially more likely in diamonds with more volume (volume is on a cubic scale).
First, we need a function to transform the Carat variable.
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
We can see that with these transformations we used to get our data on this scale, things look almost linear; We can now move forward and see about modelling our data using just a linear model.
So far, we haven’t done anything about overplotting (when multiple points take on the same value).
head(sort(table(diamonds$carat), decreasing = T))
0.3 0.31 1.01 0.7 0.32 1
2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = T))
605 802 625 828 776 698
132 127 126 125 124 121
As we can see, we have a vast amount of points at the same price, which will result in some serious overplotting. This can obscure the density and the sparsity of our data at really key points. We can deal with this by making our points smaller, jittering them, and adding transparency.
ggplot(aes(carat, price), data = diamonds) +
geom_point(alpha = 0.5, size = 0.75, position = 'jitter') +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
We can see what looks like an almost linear relationship between carat weight and price after doing some transformations. But surely there are other factors that must influence the price of a diamond.
Clarity seem to factor into price. However, many consumers are looking for a diamond of a minimum size, so we shouldn’t expect clarity to be as strong a factor as carat weight.
According to Blue Nile, the cut of a diamond has a much more consequential impact on that fiery quality that jewelers describe when they talk about diamonds.
Many of the imperfections on clarity are microscopic and do not affect the diamonds beauty in any discernible way.
Let’s see if clarity, cut or color can explain some of the variants in price when we visualize it on our plot using color.
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
Clarity does seem to explain an awful lot of the remaining variance in price, after adding color to our plot. Holding carat weight constant, looking at one part of the plot, we the diamonds with lower clarity are almost always cheaper than diamonds with better clarity.
Alter the code below.
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
Despite what Blue Nile says, we don’t see much variation on the cut. Most of the diamonds in the data are ideal cut anyway, so we’ve lost the color pattern that we saw before.
Alter the code below.
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color', reverse = FALSE,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
Color does seem to explain some of the variance in price, just like the clarity variable. Blue Nile however states that the difference between all color grades from D to J are basically not noticeable to the naked eye. Yet, we do see the color difference in the price tag.
In R, we can create models using the lm function. lm(y~x) where y is the outcome variable and x the explanatory variable.
Here, we would use: log10(price) ~ carat^(1/3) Price is the outcome and carat the predictor variable. Using our domain specific knowledge of diamonds and carat weight, we know we must take the cube root of carat weight (volume).
We apply a log transformation to our long tailed dollar variable, and we speculate that the flawless diamond should become exponentially rarer as the volume increases.
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
Calls:
m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
data = diamonds)
m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
clarity, data = diamonds)
=============================================================================
m1 m2 m3 m4 m5
-----------------------------------------------------------------------------
(Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
(0.006) (0.019) (0.019) (0.017) (0.010)
I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
(0.007) (0.032) (0.031) (0.028) (0.016)
carat -1.137*** -1.163*** -0.992*** -1.093***
(0.012) (0.011) (0.010) (0.006)
cut: .L 0.224*** 0.224*** 0.120***
(0.004) (0.004) (0.002)
cut: .Q -0.062*** -0.062*** -0.031***
(0.004) (0.003) (0.002)
cut: .C 0.051*** 0.052*** 0.014***
(0.003) (0.003) (0.002)
cut: ^4 0.018*** 0.018*** -0.002
(0.003) (0.002) (0.001)
color: .L -0.373*** -0.441***
(0.003) (0.002)
color: .Q -0.129*** -0.093***
(0.003) (0.002)
color: .C 0.001 -0.013***
(0.003) (0.002)
color: ^4 0.029*** 0.012***
(0.003) (0.002)
color: ^5 -0.016*** -0.003*
(0.003) (0.001)
color: ^6 -0.023*** 0.001
(0.002) (0.001)
clarity: .L 0.907***
(0.003)
clarity: .Q -0.240***
(0.003)
clarity: .C 0.131***
(0.003)
clarity: ^4 -0.063***
(0.002)
clarity: ^5 0.026***
(0.002)
clarity: ^6 -0.002
(0.002)
clarity: ^7 0.032***
(0.001)
-----------------------------------------------------------------------------
R-squared 0.924 0.935 0.939 0.951 0.984
adj. R-squared 0.924 0.935 0.939 0.951 0.984
sigma 0.280 0.259 0.250 0.224 0.129
F 652012.063 387489.366 138654.523 87959.467 173791.084
p 0.000 0.000 0.000 0.000 0.000
Log-likelihood -7962.499 -3631.319 -1837.416 4235.240 34091.272
Deviance 4242.831 3613.360 3380.837 2699.212 892.214
AIC 15930.999 7270.637 3690.832 -8442.481 -68140.544
BIC 15957.685 7306.220 3761.997 -8317.942 -67953.736
N 53940 53940 53940 53940 53940
=============================================================================
Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.
Our model is: lm(price) = 0.415 + 9.144 x carat^(1/3) - 1.093 x carat + … x cut + … x color + … x clarity) + E
E being the error term
Our data is from 2008. We need to account for inflation, and the diamond market is quite different now than it was. Prices plummeted in 2008 due to the global financial crisis. Since then prices, at least for wholesale polished diamonds, have grown at about 6% per year, compound annual rate.
The rapidly growing number of couples in China buying diamond engagement rings might also explain this increase.
Finally, diamonds prices grew unevenly across different carat sizes since 2008, meaning the model we estimated couldn’t simply be adjusted by inflation.