Creation of Machine Learning Model (Logistic Regression)

Course done by Isaac González


What is this notebook?

This notebook has been made in parallel to Isaac González Machine learning challenge course that you can find in this link ML Challenge.

This course is highly recommended both for people who are just starting out and for people who know but have not touched R and want to see a complete flow with developed in R with the RStudio IDE. It has helped me to know what differences are between R and Python, and above all for adding to my knowledge a brushstroke of the methodology that a professional follows when facing an ML problem.

I have been following the course by using Python language and seeing its parallel capabilities, since all the scripts, more or less, have a 'translation' to python from R

I leave you therefore, if someone gives with this link a summary of the 3 classes of Isaac González developed in python instead of R, for completeness. Fervently I encourage to take his course and follow his classes (approx. 3h) of pure learning.

Course objective

The objective is to analyze a data set of a machine that can fail suddenly, it has a whole series of instruments attached to it to monitor temperature, humidity and different measurements, we will analyze raw data and finally modeling it which will allow with input data to know if the machine can fail or not, or what is the possibility of failure.

Those of you who follow Isaac González's course: You will see that I try to avoid as much as possible 'spoilers' of the course, but I have to explain something so that the notebook makes some sense, you will also see that it has some additions because I found it interesting.

Importing Libraries

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import numpy as np
%matplotlib inline

Reading the data and first visualization

df = pd.read_csv("DataSetFallosMaquina.csv",sep=';')
df.head()
Temperature Humidity Operator Measure1 Measure2 Failure
0 67 82 Operator1 291 1 No
1 68 77 Operator1 1180 1 Yes

Data Analysis

Basic Statistics

df.describe()
Temperature Humidity Measure1
count 8784.000000 8784.000000 8784.000000
mean 64.026412 83.337090 1090.900387
std 2.868833 4.836256 537.097769
min 5.000000 65.000000 155.000000
25% 62.000000 80.000000 629.000000
50% 64.000000 83.000000 1096.000000
75% 66.000000 87.000000 1555.000000
max 78.000000 122.000000 2011.000000

Check if there is some null values

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8784 entries, 0 to 8783
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Temperature                   8784 non-null   int64 
 1   Humidity                      8784 non-null   int64 
 2   Operator                      8784 non-null   object
 3   Measure1                      8784 non-null   int64 
 4   Measure2                      8784 non-null   int64 
 5   Measure3                      8784 non-null   int64 
 6   Measure4                      8784 non-null   int64 
 7   Measure5                      8784 non-null   int64 
 8   Measure6                      8784 non-null   int64 
 9   Measure7                      8784 non-null   int64 
 10  Measure8                      8784 non-null   int64 
 11  Measure9                      8784 non-null   int64 
 12  Measure10                     8784 non-null   int64 
 13  Measure11                     8784 non-null   int64 
 14  Measure12                     8784 non-null   int64 
 15  Measure13                     8784 non-null   int64 
 16  Measure14                     8784 non-null   int64 
 17  Measure15                     8784 non-null   int64 
 18  Hours Since Previous Failure  8784 non-null   int64 
 19  Failure                       8784 non-null   object
dtypes: int64(18), object(2)
memory usage: 1.3+ MB

We will see features distribution

df.hist(figsize=(15,15))
plt.show()

png

We will check if there is some correlation between variables, to delete one of the correlated if it's necessary in our case not, as you could see:

plt.figure(figsize=(10,10))
sns.heatmap(df.corr(method='pearson'))
plt.title('Correlation Matrix')
plt.show()

png

More visual elements that can lead us to understand the behavior of variables, their distribution functions, outliers, whether or not they are continuous or categorical variables, etc ...

I will not add so many comments, if you are interested follow Isaac Gonzalez Course.

for col in df:
    if df[col].dtype == 'int64':
        df[col].plot(kind='kde',figsize=(10,5),title=col,grid=True)
        plt.show()

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

for col in df:
    if df[col].dtype == 'int64':
        df.boxplot(column=col)
        plt.show()

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

Data Transformation

Outliers

df.Temperature.plot(kind='kde')
plt.show()

png

sns.boxplot(y='Temperature',data=df)
plt.title('Distribución Temperatura')
plt.show()

png

df = df[df.Temperature > 50]#drop outliers
df.Temperature.plot(kind='kde')
plt.show()

png

sns.boxplot(y='Temperature',data=df)
plt.title('Distribución Temperatura')
plt.show()

png

Number categorical values to strings

df['Measure2'] = df.Measure2.astype(str)
df['Measure3'] = df.Measure3.astype(str)
df.Measure2.hist()
df.Measure3.hist()
plt.title('Categorical values distribution')
plt.show()

png

Balancing Data

print('Yes = ','{:.2%}'.format(df['Failure'][df.Failure=='Yes'].count() / df.Failure.count()))
print('No = ','{:.2%}'.format(df['Failure'][df.Failure=='No'].count() / df.Failure.count()))
Yes =  0.92%
No =  99.08%
df_no = df[df.Failure=='No'].sample(frac=0.05,random_state=1234)
df_si = df[df.Failure=='Yes']
df_res = df_no.append(df_si)
print('Yes = ','{:.2%}'.format(df_res['Failure'][df_res.Failure=='Yes'].count() / df_res.Failure.count()))
print('No = ','{:.2%}'.format(df_res['Failure'][df_res.Failure=='No'].count() / df_res.Failure.count()))
Yes =  15.70%
No =  84.30%

Converting Yes/No values to 1/0 values

df_res.Failure = df_res.Failure.map({'Yes':1,'No':0})
#BACKUP AND CONTINUE
df_backup = df.copy()
df = df_res.copy()

One-Hot encoder for categorical values

#Operators
df_Operators = pd.get_dummies(df.Operator)
df.drop('Operator',axis=1,inplace=True)
#Measure 2
df_ms2 = pd.get_dummies(df.Measure2)
for col in df_ms2:
    new_name = 'Measure2_'+ str(col)
    df_ms2.rename(index=str,columns={col: new_name},inplace=True)
df.drop('Measure2',axis=1,inplace=True)

#Measure 3
df_ms3 = pd.get_dummies(df.Measure3)
for col in df_ms3:
    new_name = 'Measure3_'+ str(col)
    df_ms3.rename(index=str,columns={col: new_name},inplace=True)
df.drop('Measure3',axis=1,inplace=True)
#Concatenate
df.reset_index(drop=True, inplace=True) #reset index for succesful concat.
df_Operators.reset_index(drop=True, inplace=True)
df_ms2.reset_index(drop=True, inplace=True)
df_ms3.reset_index(drop=True, inplace=True)
df_con = pd.concat([df,df_Operators,df_ms2,df_ms3],axis=1)

Modeling Logistic Regression

Split in test and training data

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
X = df.drop('Failure',axis=1)
Y = df['Failure']
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3)

Model and fitting data

logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)
print("Logistic Regression",logmodel.get_params())
Logistic Regression {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 
'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 
'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 
'tol': 0.0001, 'verbose': 0, 'warm_start': False}

Accuracy

predictions = logmodel.predict(X_test)
print(classification_report(y_test,predictions))
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       135
           1       0.75      0.75      0.75        20

    accuracy                           0.94       155
   macro avg       0.86      0.86      0.86       155
weighted avg       0.94      0.94      0.94       155
conf_matr = confusion_matrix(y_test,predictions)
print(conf_matr)
sns.heatmap(conf_matr,annot=True, cmap="Oranges" ,fmt='g')
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
[[130   5]
 [  5  15]]

png

print('Accuracy = ','{:.2%}'.format(accuracy_score(y_test,predictions)))
Accuracy =  93.55%
#python #RtoPython #dLogisticRegression

Requirements

Library
pandas
matplotlib
seaborn
sklearn

Data Origin

Data Origin
csv

DS Algorithm/Model

Model
Logistic Regression

Outputs

Function parameters
Accuracy Values