# Data Preprocessing and Linear Regression

--

**Data preprocessing** is a **data** mining technique that involves transforming raw **data** into an understandable format. Real-world **data** is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. **Data preprocessing** is a proven method of resolving such issues. It includes data cleaning and filling missing values. We have to deal with both types of data whether data is numerical or categorical.

In statistics, **linear regression** is a **linear** approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called a simple **linear regression**.

A **regression** uses the historical relationship between an independent and a dependent variable to predict the future values of the dependent variable. Businesses use **regression** to predict such things as future sales, stock prices, currency exchange rates, and productivity gains resulting from a training program.

An example is the following:

import pandas as pd

import numpy as np

from sklearn.impute import SimpleImputer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as pltdataset = pd.read_csv(“Salary_Data.csv”)

x = dataset.iloc[:, :-1].values

y = dataset.iloc[:, -1].valuesimp = SimpleImputer(missing_values=np.nan, strategy=”mean”)

x = imp.fit_transform(x)

y = y.reshape(-1, 1)

y = imp.fit_transform(y)

y = y.reshape(-1)# Splitting the dataset into the Training set and Test set

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)# Regression

reg = LinearRegression()

reg.fit(x_train, y_train)# for predict the last values

y_predict = reg.predict(x_test)# Visualize the Training Data

plt.scatter(x_train, y_train, color=”red”)

plt.plot(x_train, reg.predict(x_train), color=”blue”)

plt.title(“Linear Regression Salary vs Experience”)

plt.xlabel(“Experience in Years”)

plt.ylabel(“Salary”)

plt.show()# Visualize the Testing Data

plt.scatter(x_test, y_test, color=”red”)

plt.plot(x_train, reg.predict(x_train), color=”blue”)

plt.title(“Testing Linear Regression Salary vs Experience”)

plt.xlabel(“Experience in Years”)

plt.ylabel(“Salary”)

plt.show()

It gives the following output for training and testing:

# Data Preprocessing of Categorical Dataset

**Categorical** variables represent types of **data** which may be divided into groups. Examples of **categorical** variables **are** race, sex, age group, countries, and educational level. We will use OneHotEncoder technique to encode the categorical data in the numeric form.

Dataset would be like that:

import pandas as pd

import numpy as np

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.compose import ColumnTransformer# Data pre-processiing for categorical data

cat_dataset = pd.read_csv(“Data.csv”)

xx = pd.DataFrame(cat_dataset.iloc[:, :-1].values)

yy = pd.DataFrame(cat_dataset.iloc[:, -1].values)

# Dealing with missing valuesimp = SimpleImputer(missing_values=np.nan, strategy=”mean”)

imp = imp.fit(xx.values[:, 1:3])

xx.values[:, 1:3] = imp.transform(xx.values[:, 1:3])# Dealing with categorical datact = ColumnTransformer(

[(‘one_hot_encoder’, OneHotEncoder(), [0])], # The column numbers to be transformed (here is [0] but can be [0, 1, 3])

remainder=’passthrough’ # Leave the rest of the columns untouched

)

xx = np.array(ct.fit_transform(xx), dtype=np.float)

labelencoder_yy = LabelEncoder()

yy = labelencoder_yy.fit_transform(yy)

Output dataset would be like that:

Please follow my Github account for other useful information about machine learning.

Happy Learning:’)