Data Preprocessing and Linear Regression

Ayesha Kaleem
3 min readAug 2, 2019


Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. It includes data cleaning and filling missing values. We have to deal with both types of data whether data is numerical or categorical.

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called a simple linear regression.

A regression uses the historical relationship between an independent and a dependent variable to predict the future values of the dependent variable. Businesses use regression to predict such things as future sales, stock prices, currency exchange rates, and productivity gains resulting from a training program.

An example is the following:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
dataset = pd.read_csv(“Salary_Data.csv”)
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
imp = SimpleImputer(missing_values=np.nan, strategy=”mean”)
x = imp.fit_transform(x)
y = y.reshape(-1, 1)
y = imp.fit_transform(y)
y = y.reshape(-1)
# Splitting the dataset into the Training set and Test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
# Regression
reg = LinearRegression(), y_train)
# for predict the last values
y_predict = reg.predict(x_test)
# Visualize the Training Data
plt.scatter(x_train, y_train, color=”red”)
plt.plot(x_train, reg.predict(x_train), color=”blue”)
plt.title(“Linear Regression Salary vs Experience”)
plt.xlabel(“Experience in Years”)
# Visualize the Testing Data
plt.scatter(x_test, y_test, color=”red”)
plt.plot(x_train, reg.predict(x_train), color=”blue”)
plt.title(“Testing Linear Regression Salary vs Experience”)
plt.xlabel(“Experience in Years”)

It gives the following output for training and testing:

training dataset
Testing Dataset

Data Preprocessing of Categorical Dataset

Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, countries, and educational level. We will use OneHotEncoder technique to encode the categorical data in the numeric form.

Dataset would be like that:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
# Data pre-processiing for categorical data
cat_dataset = pd.read_csv(“Data.csv”)
xx = pd.DataFrame(cat_dataset.iloc[:, :-1].values)
yy = pd.DataFrame(cat_dataset.iloc[:, -1].values)
# Dealing with missing values
imp = SimpleImputer(missing_values=np.nan, strategy=”mean”)
imp =[:, 1:3])
xx.values[:, 1:3] = imp.transform(xx.values[:, 1:3])
# Dealing with categorical datact = ColumnTransformer(
[(‘one_hot_encoder’, OneHotEncoder(), [0])], # The column numbers to be transformed (here is [0] but can be [0, 1, 3])
remainder=’passthrough’ # Leave the rest of the columns untouched
xx = np.array(ct.fit_transform(xx), dtype=np.float)
labelencoder_yy = LabelEncoder()
yy = labelencoder_yy.fit_transform(yy)

Output dataset would be like that:

Please follow my Github account for other useful information about machine learning.

Happy Learning:’)