Linear Regression : Theory

LINEAR REGRESSION

Adopts a linear approach to modeling the relationship between a dependent variable (scalar response) and one or more independent variables (explanatory variables).

In case you have one explanatory variable, you call it a simple linear regression.

In case you have more than one independent variable, you refer to the process as multiple

linear regressions.

Y = f (x)

Y – Dependent variable

X – Independent variable

It is also called “causal analysis”

Simple Linear Regression

It is the simplest form with one dependent and one independent variable, is defined by the formula:

Y = mx + c + e

Multiple Linear Regression

Multiple Regression is valuable for quantifying the impact of various simultaneous influences upon a

single dependent variable. The equation is given as,

Y = a + b1X1 + b2X2 + B3X3 + ... + BtXt + u

Some Assumptions:

· Linearity of the variables.

· Constant variance of the error terms.

· Independence of the error terms.

· Normality of the error term distribution.

DISADVANTAGE OF LINEAR REGRESSION:

Outliers
Over/Under fitting model

Important Terms :

Correlation: Correlation signifies the strength of linear relationship. (It only captures linear

Relationship)

Covariance: It is a measure which helps to find out the direction of relationship between two variables.

I.e. what happens to Y when X increases or decreases?

The covariance is very hard to compare. E.g. when you compare height and weight in different units (meter-Kg & inch-Kg) the covariance will differ.

The solution to this is to normalize the covariance by removing its unit and get the values between -1 and 1, which is correlation.

· It is unit free.

· Ranges between -1 and 1.

o The closer to 1, the stronger the positive linear relationship.

o The closer to -1, the stronger the negative linear relationship.

o The closer to 0, the weaker the linear relationship.

Multi-Collinearity : The independent variables in the dataset should not exhibit any multi-collinearity. In case they do, it should be at the barest minimum. There should be a restriction on their value depending on the domain requirement.

Issues with multi-collinearity-

Significant problem faced in the Regression Analysis is when the independent variables or the linear

combinations of the independent variables are correlated with each other.

This correlation among the independent variables is called Multicollinearity which creates problems in conducting t-statistic for statistical significance.

The most common method of correcting multicollinearity is by systematically removing the independent variable until multicollinearity is minimized.

Presence of multicollinearity is checked by VIF (Variance Inflation Factor)

(Normally, if vif < 5 than no multicollinearity if vif > 5 than multicollinearity is present)

Homoscedasticity: It states that there should be an equal distribution of errors.

Heteroscedasticity: It entails that there is no equal distribution of the error terms. You use a log function to rectify this phenomenon.

When the requirement of a constant variance is violated, we have a condition of heteroskedasticity.

We can diagnose heteroskedasticity by plotting the residual against the predicted y or by Breusch-Paganmchi-square test.

Rcode to detect heteroskedasticity :

#plot model and residual

plot(model.real$residuals,

model.real$Price)

#bptest

library(lmtest)

bptest(model.real)

Error Terms:

R square accounts for the variation of all independent variables on the dependent variable. In

other words, it considers each independent variable for explaining the variation.

Adjusted R square, it accounts for the significant variables alone for

indicating the percentage of variation in the model. By significant, we refer to the

P values less than 0.05

Outlier: An outlier is an observation point distant from other observations. It might be due to a variance in the measurement. It can also indicate an experimental error. Under such cir-cumstances, you need to exclude the same from the data set. If you do not detect and treat them, they can cause problems in statistical analysis.

Example:

You can see that 3 is the outlier in this example:

There is no strict mathematical calculation of how to determine an outlier. Deciding whether an observation is an outlier or not, is itself a subjective exercise. However, you can detect out-liers through various methods. Some of them are graphical and are known as normal probability plots whereas some are model-based. You have some hybrid techniques such as Boxplots. Once you have detected the outlier, you should either remove them or correct them to ensure accurate analysis. Some of the methods of eliminating outliers are the Z-Score and the IQR Score methods.

# R code for Boxplot

boxplot(abc$x, main='X',

sub=paste('Outliers: ',

boxplot.stats(abc$x)$out))

##Here abc is the dataset and x is one of the variable.

##You need to do this step for each variable and treat those showing outlier.

##discarding outlier using Winsoring method

bench = Q3 - 1.5*IQR(data)

data[data > bench] <- bench

##Here we set a bench mark to the outlier.

##values going beyond that bench mark is set to bench mark

Q-Q plot

Q-Q plot is a graphical plotting of the quantiles of two distributions with respect to each other.
In other words, you plot quantiles against quantiles. Whenever you interpret a Q-Q plot, you should concentrate on the ‘y = x’ line. You also call it the 45-degree line in statistics. Itentails that each of your distributions has the same quantiles. In case you witness a deviation from this line, one of the distributions could be skewed when compared to the other.

Detecting the best model

We sometimes need to check whether their is any significant difference between the Full model(original model) and Reduced model(model after excluding variables from the original model).

We use Partial F-test to check the significance.