# Data Preprocessing and Linear Regression

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. It includes data cleaning and filling missing values. We have to deal with both types of data whether data is numerical or categorical.

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called a simple linear regression.

A regression uses the historical relationship between an independent and a dependent variable to predict the future values of the dependent variable. Businesses use regression to predict such things as future sales, stock prices, currency exchange rates, and productivity gains resulting from a training program.

An example is the following:

`import pandas as pdimport numpy as npfrom sklearn.impute import SimpleImputerfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionimport matplotlib.pyplot as pltdataset = pd.read_csv(“Salary_Data.csv”)x = dataset.iloc[:, :-1].valuesy = dataset.iloc[:, -1].valuesimp = SimpleImputer(missing_values=np.nan, strategy=”mean”)x = imp.fit_transform(x)y = y.reshape(-1, 1)y = imp.fit_transform(y)y = y.reshape(-1)# Splitting the dataset into the Training set and Test setx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)# Regressionreg = LinearRegression()reg.fit(x_train, y_train)# for predict the last valuesy_predict = reg.predict(x_test)# Visualize the Training Dataplt.scatter(x_train, y_train, color=”red”)plt.plot(x_train, reg.predict(x_train), color=”blue”)plt.title(“Linear Regression Salary vs Experience”)plt.xlabel(“Experience in Years”)plt.ylabel(“Salary”)plt.show()# Visualize the Testing Dataplt.scatter(x_test, y_test, color=”red”)plt.plot(x_train, reg.predict(x_train), color=”blue”)plt.title(“Testing Linear Regression Salary vs Experience”)plt.xlabel(“Experience in Years”)plt.ylabel(“Salary”)plt.show()`

It gives the following output for training and testing:

# Data Preprocessing of Categorical Dataset

Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, countries, and educational level. We will use OneHotEncoder technique to encode the categorical data in the numeric form.

Dataset would be like that:

`import pandas as pdimport numpy as npfrom sklearn.impute import SimpleImputerfrom sklearn.preprocessing import LabelEncoder, OneHotEncoderfrom sklearn.compose import ColumnTransformer# Data pre-processiing for categorical datacat_dataset = pd.read_csv(“Data.csv”)xx = pd.DataFrame(cat_dataset.iloc[:, :-1].values)yy = pd.DataFrame(cat_dataset.iloc[:, -1].values)# Dealing with missing valuesimp = SimpleImputer(missing_values=np.nan, strategy=”mean”)imp = imp.fit(xx.values[:, 1:3])xx.values[:, 1:3] = imp.transform(xx.values[:, 1:3])# Dealing with categorical datact = ColumnTransformer( [(‘one_hot_encoder’, OneHotEncoder(), )], # The column numbers to be transformed (here is  but can be [0, 1, 3]) remainder=’passthrough’ # Leave the rest of the columns untouched)xx = np.array(ct.fit_transform(xx), dtype=np.float)labelencoder_yy = LabelEncoder()yy = labelencoder_yy.fit_transform(yy)`

Output dataset would be like that: