This notebook has been made in parallel to Isaac González Machine learning challenge course that you can find in this link ML Challenge.
This course is highly recommended both for people who are just starting out and for people who know but have not touched R and want to see a complete flow with developed in R with the RStudio IDE. It has helped me to know what differences are between R and Python, and above all for adding to my knowledge a brushstroke of the methodology that a professional follows when facing an ML problem.
I have been following the course by using Python language and seeing its parallel capabilities, since all the scripts, more or less, have a 'translation' to python from R
I leave you therefore, if someone gives with this link a summary of the 3 classes of Isaac González developed in python instead of R, for completeness. Fervently I encourage to take his course and follow his classes (approx. 3h) of pure learning.
The objective is to analyze a data set of a machine that can fail suddenly, it has a whole series of instruments attached to it to monitor temperature, humidity and different measurements, we will analyze raw data and finally modeling it which will allow with input data to know if the machine can fail or not, or what is the possibility of failure.
Those of you who follow Isaac González's course: You will see that I try to avoid as much as possible 'spoilers' of the course, but I have to explain something so that the notebook makes some sense, you will also see that it has some additions because I found it interesting.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import numpy as np
%matplotlib inline
df = pd.read_csv("DataSetFallosMaquina.csv",sep=';')
df.head()
Temperature | Humidity | Operator | Measure1 | Measure2 | Failure | |
---|---|---|---|---|---|---|
0 | 67 | 82 | Operator1 | 291 | 1 | No |
1 | 68 | 77 | Operator1 | 1180 | 1 | Yes |
df.describe()
Temperature | Humidity | Measure1 | |
---|---|---|---|
count | 8784.000000 | 8784.000000 | 8784.000000 |
mean | 64.026412 | 83.337090 | 1090.900387 |
std | 2.868833 | 4.836256 | 537.097769 |
min | 5.000000 | 65.000000 | 155.000000 |
25% | 62.000000 | 80.000000 | 629.000000 |
50% | 64.000000 | 83.000000 | 1096.000000 |
75% | 66.000000 | 87.000000 | 1555.000000 |
max | 78.000000 | 122.000000 | 2011.000000 |
Check if there is some null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8784 entries, 0 to 8783
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Temperature 8784 non-null int64
1 Humidity 8784 non-null int64
2 Operator 8784 non-null object
3 Measure1 8784 non-null int64
4 Measure2 8784 non-null int64
5 Measure3 8784 non-null int64
6 Measure4 8784 non-null int64
7 Measure5 8784 non-null int64
8 Measure6 8784 non-null int64
9 Measure7 8784 non-null int64
10 Measure8 8784 non-null int64
11 Measure9 8784 non-null int64
12 Measure10 8784 non-null int64
13 Measure11 8784 non-null int64
14 Measure12 8784 non-null int64
15 Measure13 8784 non-null int64
16 Measure14 8784 non-null int64
17 Measure15 8784 non-null int64
18 Hours Since Previous Failure 8784 non-null int64
19 Failure 8784 non-null object
dtypes: int64(18), object(2)
memory usage: 1.3+ MB
We will see features distribution
df.hist(figsize=(15,15))
plt.show()
We will check if there is some correlation between variables, to delete one of the correlated if it's necessary in our case not, as you could see:
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(method='pearson'))
plt.title('Correlation Matrix')
plt.show()
More visual elements that can lead us to understand the behavior of variables, their distribution functions, outliers, whether or not they are continuous or categorical variables, etc ...
I will not add so many comments, if you are interested follow Isaac Gonzalez Course.
for col in df:
if df[col].dtype == 'int64':
df[col].plot(kind='kde',figsize=(10,5),title=col,grid=True)
plt.show()
for col in df:
if df[col].dtype == 'int64':
df.boxplot(column=col)
plt.show()
df.Temperature.plot(kind='kde')
plt.show()
sns.boxplot(y='Temperature',data=df)
plt.title('Distribución Temperatura')
plt.show()
df = df[df.Temperature > 50]#drop outliers
df.Temperature.plot(kind='kde')
plt.show()
sns.boxplot(y='Temperature',data=df)
plt.title('Distribución Temperatura')
plt.show()
df['Measure2'] = df.Measure2.astype(str)
df['Measure3'] = df.Measure3.astype(str)
df.Measure2.hist()
df.Measure3.hist()
plt.title('Categorical values distribution')
plt.show()
print('Yes = ','{:.2%}'.format(df['Failure'][df.Failure=='Yes'].count() / df.Failure.count()))
print('No = ','{:.2%}'.format(df['Failure'][df.Failure=='No'].count() / df.Failure.count()))
Yes = 0.92%
No = 99.08%
df_no = df[df.Failure=='No'].sample(frac=0.05,random_state=1234)
df_si = df[df.Failure=='Yes']
df_res = df_no.append(df_si)
print('Yes = ','{:.2%}'.format(df_res['Failure'][df_res.Failure=='Yes'].count() / df_res.Failure.count()))
print('No = ','{:.2%}'.format(df_res['Failure'][df_res.Failure=='No'].count() / df_res.Failure.count()))
Yes = 15.70%
No = 84.30%
df_res.Failure = df_res.Failure.map({'Yes':1,'No':0})
#BACKUP AND CONTINUE
df_backup = df.copy()
df = df_res.copy()
#Operators
df_Operators = pd.get_dummies(df.Operator)
df.drop('Operator',axis=1,inplace=True)
#Measure 2
df_ms2 = pd.get_dummies(df.Measure2)
for col in df_ms2:
new_name = 'Measure2_'+ str(col)
df_ms2.rename(index=str,columns={col: new_name},inplace=True)
df.drop('Measure2',axis=1,inplace=True)
#Measure 3
df_ms3 = pd.get_dummies(df.Measure3)
for col in df_ms3:
new_name = 'Measure3_'+ str(col)
df_ms3.rename(index=str,columns={col: new_name},inplace=True)
df.drop('Measure3',axis=1,inplace=True)
#Concatenate
df.reset_index(drop=True, inplace=True) #reset index for succesful concat.
df_Operators.reset_index(drop=True, inplace=True)
df_ms2.reset_index(drop=True, inplace=True)
df_ms3.reset_index(drop=True, inplace=True)
df_con = pd.concat([df,df_Operators,df_ms2,df_ms3],axis=1)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
X = df.drop('Failure',axis=1)
Y = df['Failure']
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3)
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)
print("Logistic Regression",logmodel.get_params())
Logistic Regression {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True,
'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto',
'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs',
'tol': 0.0001, 'verbose': 0, 'warm_start': False}
predictions = logmodel.predict(X_test)
print(classification_report(y_test,predictions))
precision recall f1-score support
0 0.96 0.96 0.96 135
1 0.75 0.75 0.75 20
accuracy 0.94 155
macro avg 0.86 0.86 0.86 155
weighted avg 0.94 0.94 0.94 155
conf_matr = confusion_matrix(y_test,predictions)
print(conf_matr)
sns.heatmap(conf_matr,annot=True, cmap="Oranges" ,fmt='g')
plt.tight_layout()
plt.title('Confusion matrix')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
[[130 5]
[ 5 15]]
print('Accuracy = ','{:.2%}'.format(accuracy_score(y_test,predictions)))
Accuracy = 93.55%
Library |
---|
pandas |
matplotlib |
seaborn |
sklearn |
Data Origin |
---|
csv |
Model |
---|
Logistic Regression |
Function parameters |
---|
Accuracy Values |