Adopts a linear approach to modeling the relationship
between a dependent
variable (scalar response) and one or
more independent
variables (explanatory
variables).
In case you have one explanatory variable, you call it a
simple
linear regression.
In case you have more than one independent variable, you
refer to the process as multiple
linear regressions.
Y = f (x)
Y – Dependent variable
X – Independent variable
It is also called “causal analysis”
Simple Linear Regression
It is the simplest form with one
dependent and one independent variable, is defined by the formula:
Y = mx + c + e
Multiple Linear Regression
Multiple
Regression is valuable for quantifying the impact of various simultaneous
influences upon a
single
dependent variable. The equation is given as,
Y = a + b1X1 + b2X2 + B3X3 + ... + BtXt + u
Some Assumptions:
· Linearity
of the variables.
· Constant
variance of the error terms.
· Independence
of the error terms.
·
Normality of the error term distribution.
DISADVANTAGE OF LINEAR REGRESSION:
- Outliers
- Over/Under fitting model
Important Terms :
Correlation:
Correlation signifies the strength of linear
relationship. (It only captures linear
Relationship)
Covariance: It is a measure which helps to find out the direction of relationship between two variables.
I.e. what happens
to Y when X increases or decreases?
The covariance is
very hard to compare. E.g. when you compare height and weight in different units (meter-Kg & inch-Kg) the
covariance will differ.
The solution to
this is to normalize the covariance by removing its unit and get the values
between -1 and 1, which is correlation.
· It is unit free.
· Ranges between -1 and 1.
o The closer
to 1, the stronger the positive linear relationship.
o The closer
to -1, the stronger the negative linear relationship.
o The closer to 0, the weaker the linear relationship.
Issues with multi-collinearity-
Significant
problem faced in the Regression Analysis is when the independent variables or
the linear
combinations
of the independent variables are correlated with each other.
This
correlation among the independent variables is called Multicollinearity which
creates problems in conducting
t-statistic for statistical significance.
The most
common method of correcting multicollinearity is by systematically removing the
independent variable
until multicollinearity is minimized.
Presence of
multicollinearity is checked by VIF (Variance Inflation Factor)
(Normally, if vif < 5 than no multicollinearity if vif > 5 than
multicollinearity is present)
Heteroscedasticity: It entails that there is no equal distribution of the error terms. You use a log function to rectify this phenomenon.
When the
requirement of a constant variance is violated, we have a condition of
heteroskedasticity.
We can
diagnose heteroskedasticity by plotting the residual against the predicted y or
by Breusch-Paganmchi-square test.
Rcode to detect heteroskedasticity :
Rcode to detect heteroskedasticity :
#plot model and residual
plot(model.real$residuals,
model.real$Price)
#bptest
library(lmtest)
bptest(model.real)
Error Terms:
R square accounts for the variation of all independent variables
on the dependent variable. In
other words, it considers each independent variable for
explaining the variation.
Adjusted R square, it accounts for the
significant variables alone for
indicating the percentage of variation in the model. By
significant, we refer to the
P values
less than 0.05
Outlier: An outlier is an observation point distant from other
observations. It might be due to a variance in the measurement. It can also indicate an experimental
error. Under such cir-cumstances, you need to exclude the same from the data set. If you
do not detect and treat them, they can cause problems in statistical analysis.
Example:
You can see that 3 is the outlier in this example:
There is no strict mathematical calculation of how to
determine an outlier. Deciding whether an observation is an outlier or not, is itself a subjective
exercise. However, you can detect out-liers through various methods. Some of them are graphical and
are known as normal probability plots whereas some are model-based. You have some hybrid
techniques such as Boxplots. Once you have detected the outlier, you should either remove
them or correct them to ensure accurate analysis. Some of the methods of
eliminating outliers are the Z-Score and the IQR Score
methods.
# R code for Boxplot
boxplot(abc$x, main='X',
sub=paste('Outliers: ',
boxplot.stats(abc$x)$out))
##Here abc is the dataset and x is one of the variable.
##You need to do this step for each variable and treat those showing outlier.
##discarding outlier using Winsoring method
bench
= Q3 - 1.5*IQR(data)
data[data > bench] <- bench
##Here we set a bench mark to the outlier.
##values going beyond that bench mark is set to bench mark
Q-Q plot
Q-Q plot is a graphical plotting of the quantiles of two distributions
with respect to each other.
In other words, you plot quantiles against quantiles. Whenever you interpret a Q-Q plot, you should concentrate on the ‘y = x’ line. You also call it the 45-degree line in statistics. Itentails that each of your distributions has the same quantiles. In case you witness a deviation from this line, one of the distributions could be skewed when compared to the other.
In other words, you plot quantiles against quantiles. Whenever you interpret a Q-Q plot, you should concentrate on the ‘y = x’ line. You also call it the 45-degree line in statistics. Itentails that each of your distributions has the same quantiles. In case you witness a deviation from this line, one of the distributions could be skewed when compared to the other.
Detecting the best model
We sometimes need to check whether their is any significant difference between the Full model(original model) and Reduced model(model after excluding variables from the original model).
We use Partial F-test to check the significance.
#R code
anova(Reduced model, Full model)
##we can
see RSS , f-statistic and p-value
##if p is
extremely small, then reject H0
##H0 : models
do not significantly differ
##H1 : full
model is better
##here we do
not reject H0
##both model
should be nested
Happy Learning!!!!
Comments
Post a Comment