PROJECT DESIGN

In this case we will be working for a photovoltaic solar power generation company.

Anomalous behaviors have been detected in 2 of the plants and the maintenance subcontractor is unable to identify the reason.

Before deploying a team of engineers, they ask the data science team to analyze the data from the sensors and meters to see if we can detect the problem.

In this case, among other things, we are going to learn:

  • how these types of solar plants work
  • analysis to be carried out on datasets where the time variable is very important
  • how to approach analysis in projects where data is collected by sensors or meters

Therefore, much of what we learn here is generally applicable to industry and IoT projects:

  • factory production analysis
  • other types of energy
  • smart cities
  • IoT in agriculture
  • etc.

OBJECTIVE

Analyze the available data to try to intuit where the problems may be and whether or not it is necessary to send a team of engineers to the plants.

LEVERS

In this type of project in which there is a clear process, the most IMPORTANT part is to know and understand that process.

Let's see, for example, how in this case, which seems easy a priori due to the apparent simplicity of the data, if we do not design the project guided by the process, we could get into an infinite loop of analysis without getting anywhere.

Once we understand how the business and the process work, the levers will come out on their own.

How does a photovoltaic solar power plant work?

png

Therefore, the levers that influence the business objective (in this case, generate AC current) are:

  1. Irradiation: the greater the irradiation, the greater the DC generated. But it is not monotonic, from certain values higher temperatures can reduce the generation capacity
  2. Condition of the panels: they must be clean and in good working order to generate the most DC energy possible
  3. Inverter efficiency: there is always a loss in the transformation from DC to AC, but it must be as little as possible. They must also be in proper condition and working order.
  4. Meters and sensors: if they break down and do not measure well, we lose traceability and the possibility of detecting failures

KPI's

  • Irradiation: measures the solar energy that arrives
  • Ambient and module temperature: measured by plant sensors in degrees Celsius
  • DC power: measure the kW of direct current
  • AC power: measure the kW of alternating current
  • Inverter efficiency (we will create it): measures the capacity of transformation from DC to AC. It is calculated as AC/DC * 100

ENTITIES AND DATA

To determine the entities it is necessary to know what a solar plant is composed of.

The minimum unit is the cell, it is there where the generation of energy by reaction with the photons of the sun takes place.

The cells are encapsulated in "rectangles" called modules.

Several modules form a panel.

The panels are arranged in rows called arrays.

An inverter receives direct current from several arrays.

A plant can have several inverters.

There are also the meters and sensors, which may be one or more.

jpeg

In our case, the entities that we have in the granularity of the data are:

  • 15-minute windows over a 34-day period
  • Plants: there are 2
  • Inverters: several per plant
  • Only one irradiation sensor per plant
  • Only one room temperature sensor per floor
  • Only one module temperature sensor per floor

This conditions that we will be able to know, for example, if an inverter in a plant has lower performance than expected, but we will not know which array, panel or module may be causing it.

SEED QUESTIONS

Having understood the levers, kpis and entities, we can now ask the seed questions:

About irradiation:

  • Is there enough irradiation every day?
  • Is it similar on both floors?
  • How is your hourly distribution?
  • How is it related to ambient temperature and module temperature?

About the plants:

  • Does the same amount of irradiation reach them?
  • Do they have a similar number of inverters?
  • Do they generate similar amounts of DC?
  • Do they generate a similar amount of AC?

About the DC generation:

  • What is the relationship between irradiation and DC generation?
  • Is it ever affected by ambient or module temperature?
  • Is it similar on both floors?
  • How is it distributed throughout the day?
  • Is it constant throughout the days?
  • Is it constant in all inverters?
  • Have there been moments of failure?

About AC generation:

  • What is the relationship between DC and AC generation?
  • Is it similar on both floors?
  • How is it distributed throughout the day?
  • Is it constant throughout the days?
  • Is it constant in all inverters?
  • Have there been moments of failure?

About meters and sensors:

  • Are the irradiation data reliable?
  • Is the temperature data reliable?
  • Is the DC data reliable?
  • Are the CA data reliable?
  • Are the data similar between the two plants?

DATA QUALITY AND ANALYTICAL DATAMART CREATION

SET UP

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#Automcompletar rápido
%config IPCompleter.greedy=True

import fiser_tools as fs
fs.misc.dark_theme()

DATA UPLOAD

This case consists of 4 files:

  • Plant 1, generation data
  • plant 1, environmental sensor data
  • Plant 2, generation data
  • Plant 2, environmental sensor data

Load data plant 1 - generation data

p1g = pd.read_csv('Plant_1_Generation_Data.csv')
p1g
DATE_TIME PLANT_ID SOURCE_KEY DC_POWER AC_POWER DAILY_YIELD TOTAL_YIELD
0 15-05-2020 00:00 4135001 1BY6WEcLGh8j5v7 0.0 0.0 0.000 6259559.0
1 15-05-2020 00:00 4135001 1IF53ai7Xc0U56Y 0.0 0.0 0.000 6183645.0
2 15-05-2020 00:00 4135001 3PZuoBAID5Wc2HD 0.0 0.0 0.000 6987759.0
3 15-05-2020 00:00 4135001 7JYdWkrLSPkdwr4 0.0 0.0 0.000 7602960.0
4 15-05-2020 00:00 4135001 McdE0feGgRqW7Ca 0.0 0.0 0.000 7158964.0
... ... ... ... ... ... ... ...
68773 17-06-2020 23:45 4135001 uHbuxQJl8lW7ozc 0.0 0.0 5967.000 7287002.0
68774 17-06-2020 23:45 4135001 wCURE6d3bPkepu2 0.0 0.0 5147.625 7028601.0
68775 17-06-2020 23:45 4135001 z9Y9gH1T5YWrNuG 0.0 0.0 5819.000 7251204.0
68776 17-06-2020 23:45 4135001 zBIq5rxdHJRwDNY 0.0 0.0 5817.000 6583369.0
68777 17-06-2020 23:45 4135001 zVJPv84UY57bAof 0.0 0.0 5910.000 7363272.0

68778 rows × 7 columns

Load data plant 1 - environmental sensor data

p1w = pd.read_csv('Plant_1_Weather_Sensor_Data.csv')
p1w
DATE_TIME PLANT_ID SOURCE_KEY AMBIENT_TEMPERATURE MODULE_TEMPERATURE IRRADIATION
0 2020-05-15 00:00:00 4135001 HmiyD2TTLFNqkNe 25.184316 22.857507 0.0
1 2020-05-15 00:15:00 4135001 HmiyD2TTLFNqkNe 25.084589 22.761668 0.0
2 2020-05-15 00:30:00 4135001 HmiyD2TTLFNqkNe 24.935753 22.592306 0.0
3 2020-05-15 00:45:00 4135001 HmiyD2TTLFNqkNe 24.846130 22.360852 0.0
4 2020-05-15 01:00:00 4135001 HmiyD2TTLFNqkNe 24.621525 22.165423 0.0
... ... ... ... ... ... ...
3177 2020-06-17 22:45:00 4135001 HmiyD2TTLFNqkNe 22.150570 21.480377 0.0
3178 2020-06-17 23:00:00 4135001 HmiyD2TTLFNqkNe 22.129816 21.389024 0.0
3179 2020-06-17 23:15:00 4135001 HmiyD2TTLFNqkNe 22.008275 20.709211 0.0
3180 2020-06-17 23:30:00 4135001 HmiyD2TTLFNqkNe 21.969495 20.734963 0.0
3181 2020-06-17 23:45:00 4135001 HmiyD2TTLFNqkNe 21.909288 20.427972 0.0

3182 rows × 6 columns

Load data plant 2 - generation data

p2g = pd.read_csv('Plant_2_Generation_Data.csv')
p2g
DATE_TIME PLANT_ID SOURCE_KEY DC_POWER AC_POWER DAILY_YIELD TOTAL_YIELD
0 2020-05-15 00:00:00 4136001 4UPUqMRk7TRMgml 0.0 0.0 9425.000000 2.429011e+06
1 2020-05-15 00:00:00 4136001 81aHJ1q11NBPMrL 0.0 0.0 0.000000 1.215279e+09
2 2020-05-15 00:00:00 4136001 9kRcWv60rDACzjR 0.0 0.0 3075.333333 2.247720e+09
3 2020-05-15 00:00:00 4136001 Et9kgGMDl729KT4 0.0 0.0 269.933333 1.704250e+06
4 2020-05-15 00:00:00 4136001 IQ2d7wF4YD8zU1Q 0.0 0.0 3177.000000 1.994153e+07
... ... ... ... ... ... ... ...
67693 2020-06-17 23:45:00 4136001 q49J1IKaHRwDQnt 0.0 0.0 4157.000000 5.207580e+05
67694 2020-06-17 23:45:00 4136001 rrq4fwE8jgrTyWY 0.0 0.0 3931.000000 1.211314e+08
67695 2020-06-17 23:45:00 4136001 vOuJvMaM2sgwLmb 0.0 0.0 4322.000000 2.427691e+06
67696 2020-06-17 23:45:00 4136001 xMbIugepa2P7lBB 0.0 0.0 4218.000000 1.068964e+08
67697 2020-06-17 23:45:00 4136001 xoJJ8DcxJEcupym 0.0 0.0 4316.000000 2.093357e+08

67698 rows × 7 columns

Data upload plant 2 - environmental sensor data

p2w = pd.read_csv('Plant_2_Weather_Sensor_Data.csv')
p2w
DATE_TIME PLANT_ID SOURCE_KEY AMBIENT_TEMPERATURE MODULE_TEMPERATURE IRRADIATION
0 2020-05-15 00:00:00 4136001 iq8k7ZNt4Mwm3w0 27.004764 25.060789 0.0
1 2020-05-15 00:15:00 4136001 iq8k7ZNt4Mwm3w0 26.880811 24.421869 0.0
2 2020-05-15 00:30:00 4136001 iq8k7ZNt4Mwm3w0 26.682055 24.427290 0.0
3 2020-05-15 00:45:00 4136001 iq8k7ZNt4Mwm3w0 26.500589 24.420678 0.0
4 2020-05-15 01:00:00 4136001 iq8k7ZNt4Mwm3w0 26.596148 25.088210 0.0
... ... ... ... ... ... ...
3254 2020-06-17 22:45:00 4136001 iq8k7ZNt4Mwm3w0 23.511703 22.856201 0.0
3255 2020-06-17 23:00:00 4136001 iq8k7ZNt4Mwm3w0 23.482282 22.744190 0.0
3256 2020-06-17 23:15:00 4136001 iq8k7ZNt4Mwm3w0 23.354743 22.492245 0.0
3257 2020-06-17 23:30:00 4136001 iq8k7ZNt4Mwm3w0 23.291048 22.373909 0.0
3258 2020-06-17 23:45:00 4136001 iq8k7ZNt4Mwm3w0 23.202871 22.535908 0.0

3259 rows × 6 columns

DATA QUALITY

Plant quality 1 - generation data

We start with the overview.

p1g.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68778 entries, 0 to 68777
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE_TIME    68778 non-null  object 
 1   PLANT_ID     68778 non-null  int64  
 2   SOURCE_KEY   68778 non-null  object 
 3   DC_POWER     68778 non-null  float64
 4   AC_POWER     68778 non-null  float64
 5   DAILY_YIELD  68778 non-null  float64
 6   TOTAL_YIELD  68778 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.7+ MB

We see that there are no nulls.

We see that DATE_TIME is as object.

We convert DATE_TIME to type datetime.

p1g['DATE_TIME'] = pd.to_datetime(p1g.DATE_TIME,dayfirst=True)

We reviewed a sample of data.

p1g.head()
DATE_TIME PLANT_ID SOURCE_KEY DC_POWER AC_POWER DAILY_YIELD TOTAL_YIELD
0 2020-05-15 4135001 1BY6WEcLGh8j5v7 0.0 0.0 0.0 6259559.0
1 2020-05-15 4135001 1IF53ai7Xc0U56Y 0.0 0.0 0.0 6183645.0
2 2020-05-15 4135001 3PZuoBAID5Wc2HD 0.0 0.0 0.0 6987759.0
3 2020-05-15 4135001 7JYdWkrLSPkdwr4 0.0 0.0 0.0 7602960.0
4 2020-05-15 4135001 McdE0feGgRqW7Ca 0.0 0.0 0.0 7158964.0

We check that the plant identifier is unique.

p1g.PLANT_ID.unique()
array([4135001], dtype=int64)

Let's replace it with a more readable literal.

p1g['PLANT_ID'] = p1g.PLANT_ID.replace(4135001, 'p1')

We review the descriptions.

p1g.describe().T
count mean std min 25% 50% 75% max
DC_POWER 68778.0 3.147426e+03 4036.457169 0.0 0.000000e+00 4.290000e+02 6.366964e+03 14471.125
AC_POWER 68778.0 3.078028e+02 394.396439 0.0 0.000000e+00 4.149375e+01 6.236187e+02 1410.950
DAILY_YIELD 68778.0 3.295969e+03 3145.178309 0.0 0.000000e+00 2.658714e+03 6.274000e+03 9163.000
TOTAL_YIELD 68778.0 6.978712e+06 416271.982856 6183645.0 6.512003e+06 7.146685e+06 7.268706e+06 7846821.000

Let's remove the scientific notation display.

pd.options.display.float_format = '{:15.2f}'.format
p1g.describe().T
count mean std min 25% 50% 75% max
DC_POWER 68778.00 3147.43 4036.46 0.00 0.00 429.00 6366.96 14471.12
AC_POWER 68778.00 307.80 394.40 0.00 0.00 41.49 623.62 1410.95
DAILY_YIELD 68778.00 3295.97 3145.18 0.00 0.00 2658.71 6274.00 9163.00
TOTAL_YIELD 68778.00 6978711.76 416271.98 6183645.00 6512002.54 7146685.00 7268705.91 7846821.00

The difference in means between DC and AC is strange.

Let's visualize it.

p1g[['DC_POWER','AC_POWER']].plot(figsize = (16,12));

png

The diference is huge.

First we are going to check if they go in the same direction even if it is on a different scale (with a correlation), and then we are going to check what is the average ratio between both measures.

p1g.DC_POWER.corr(p1g.AC_POWER)
0.9999962553331414
(p1g.DC_POWER / p1g.AC_POWER).describe()
count          36827.00
mean              10.23
std                0.05
min                9.38
25%               10.20
50%               10.22
75%               10.25
max               10.47
dtype: float64

It seems that the Inverters are transforming only 10% from DC to AC, which a priori is very low.

In any case, from quality we got here and we will continue exploring this in the analysis part and comparing it with Plant 2 to see if the same thing happens.

We analyze the categorical variable, which is the identifier of the inverters.

p1g.SOURCE_KEY.nunique()
22
p1g.SOURCE_KEY.value_counts()
bvBOhCH3iADSZry    3155
1BY6WEcLGh8j5v7    3154
7JYdWkrLSPkdwr4    3133
VHMLBKoKgIrUVDU    3133
ZnxXDlPa8U1GXgE    3130
ih0vzX44oOqAx2f    3130
z9Y9gH1T5YWrNuG    3126
wCURE6d3bPkepu2    3126
uHbuxQJl8lW7ozc    3125
pkci93gMrogZuBj    3125
iCRJl6heRkivqQ3    3125
rGa61gmuvPhdLxV    3124
sjndEbLyjtCKgGv    3124
McdE0feGgRqW7Ca    3124
zVJPv84UY57bAof    3124
ZoEaEvLYb1n2sOq    3123
1IF53ai7Xc0U56Y    3119
adLQvlD726eNBSB    3119
zBIq5rxdHJRwDNY    3119
WRmjgnKYAwPKWDb    3118
3PZuoBAID5Wc2HD    3118
YxYtjZvoooNbGkE    3104
Name: SOURCE_KEY, dtype: int64

Conclusions:

 * plant 1 has 22 inverters
 * All have a similar number of measures although not exactly the same
 * They could be stops due to maintenance, or simple data loss, but we write it down for the analysis phase

We are going to analyze the DAILY_YIELD variables, since the metadata tells us that the TOTAL_YIELD variable is the accumulated total per inverter, but in DAILY_YIELD it does not specify it, so we do not know if it is accumulated per inverter or per plant.

The hypothesis is the following: if it is per plant, there should be no differences between the data from the different inverters at the same specific moment.

And therefore if we see that there are differences then it is that the data is due to the inverter.

To verify it, it is useful to take a sample of inverters.

seleccion = list(p1g.SOURCE_KEY.unique()[:5])
temp = p1g[p1g.SOURCE_KEY.isin(seleccion)].set_index('DATE_TIME')
temp
PLANT_ID SOURCE_KEY DC_POWER AC_POWER DAILY_YIELD TOTAL_YIELD
DATE_TIME
2020-05-15 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 0.00 6259559.00
2020-05-15 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6183645.00
2020-05-15 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 6987759.00
2020-05-15 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7602960.00
2020-05-15 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7158964.00
... ... ... ... ... ... ...
2020-06-17 23:45:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 5521.00 6485319.00
2020-06-17 23:45:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 6034.00 6433566.00
2020-06-17 23:45:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 6052.00 7237425.00
2020-06-17 23:45:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 5856.00 7846821.00
2020-06-17 23:45:00 p1 McdE0feGgRqW7Ca 0.00 0.00 5992.00 7408587.00

15648 rows × 6 columns

In the data we already see that it is different, but we are going to check on more data so that it is not an effect of those specific records.

We are going to see it graphically, and for simplicity we are going to take only a sample of days.

Since we have the date as index we remember that we can use partial and slice indexing.

temp = temp.loc['2020-06-01':'2020-06-05']
temp
PLANT_ID SOURCE_KEY DC_POWER AC_POWER DAILY_YIELD TOTAL_YIELD
DATE_TIME
2020-06-01 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 829.00 6377931.00
2020-06-01 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6311432.00
2020-06-01 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 7115304.00
2020-06-01 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 730.38 7727821.00
2020-06-01 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7286760.00
... ... ... ... ... ... ...
2020-06-05 23:45:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 7162.00 6412542.00
2020-06-05 23:45:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6348557.00
2020-06-05 23:45:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 7152486.00
2020-06-05 23:45:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7764140.00
2020-06-05 23:45:00 p1 McdE0feGgRqW7Ca 0.00 0.00 4950.00 7324681.00

2370 rows × 6 columns

plt.figure(figsize = (16,12))
sns.lineplot(data = temp.reset_index(), x = temp.reset_index().DATE_TIME, y = 'DAILY_YIELD', hue = 'SOURCE_KEY');

png

Definitely different inverters have different data at the same time, so we conclude that this variable is per inverter

Finally we are going to analyze the period in which we have data and if the number of daily measurements is constant.

p1g.DATE_TIME.dt.date.value_counts().sort_index().plot.bar(figsize = (12,8));

png

Conclusions:

 * The data period is between May 15, 2020 and June 17, 2020
 * We have data for every day, there are no intermediate days missing
 * But some days like 05/21 or 05/29 have fewer measurements
 * So it doesn't look 100% regular

Data quality plant 1 - environmental sensor data

p1w.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3182 entries, 0 to 3181
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   DATE_TIME            3182 non-null   object 
 1   PLANT_ID             3182 non-null   int64  
 2   SOURCE_KEY           3182 non-null   object 
 3   AMBIENT_TEMPERATURE  3182 non-null   float64
 4   MODULE_TEMPERATURE   3182 non-null   float64
 5   IRRADIATION          3182 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 149.3+ KB

We correct the type of DATE_TIME

p1w.DATE_TIME = pd.to_datetime(p1w.DATE_TIME)
p1w.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3182 entries, 0 to 3181
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   DATE_TIME            3182 non-null   datetime64[ns]
 1   PLANT_ID             3182 non-null   int64         
 2   SOURCE_KEY           3182 non-null   object        
 3   AMBIENT_TEMPERATURE  3182 non-null   float64       
 4   MODULE_TEMPERATURE   3182 non-null   float64       
 5   IRRADIATION          3182 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 149.3+ KB
p1w.head()
DATE_TIME PLANT_ID SOURCE_KEY AMBIENT_TEMPERATURE MODULE_TEMPERATURE IRRADIATION
0 2020-05-15 00:00:00 4135001 HmiyD2TTLFNqkNe 25.18 22.86 0.00
1 2020-05-15 00:15:00 4135001 HmiyD2TTLFNqkNe 25.08 22.76 0.00
2 2020-05-15 00:30:00 4135001 HmiyD2TTLFNqkNe 24.94 22.59 0.00
3 2020-05-15 00:45:00 4135001 HmiyD2TTLFNqkNe 24.85 22.36 0.00
4 2020-05-15 01:00:00 4135001 HmiyD2TTLFNqkNe 24.62 22.17 0.00

We replaced the name of the plant.

p1w['PLANT_ID'] = p1w.PLANT_ID.replace(4135001,'p1')
p1w
DATE_TIME PLANT_ID SOURCE_KEY AMBIENT_TEMPERATURE MODULE_TEMPERATURE IRRADIATION
0 2020-05-15 00:00:00 p1 HmiyD2TTLFNqkNe 25.18 22.86 0.00
1 2020-05-15 00:15:00 p1 HmiyD2TTLFNqkNe 25.08 22.76 0.00
2 2020-05-15 00:30:00 p1 HmiyD2TTLFNqkNe 24.94 22.59 0.00
3 2020-05-15 00:45:00 p1 HmiyD2TTLFNqkNe 24.85 22.36 0.00
4 2020-05-15 01:00:00 p1 HmiyD2TTLFNqkNe 24.62 22.17 0.00
... ... ... ... ... ... ...
3177 2020-06-17 22:45:00 p1 HmiyD2TTLFNqkNe 22.15 21.48 0.00
3178 2020-06-17 23:00:00 p1 HmiyD2TTLFNqkNe 22.13 21.39 0.00
3179 2020-06-17 23:15:00 p1 HmiyD2TTLFNqkNe 22.01 20.71 0.00
3180 2020-06-17 23:30:00 p1 HmiyD2TTLFNqkNe 21.97 20.73 0.00
3181 2020-06-17 23:45:00 p1 HmiyD2TTLFNqkNe 21.91 20.43 0.00

3182 rows × 6 columns

We review the statistics.

p1w.describe().T
count mean std min 25% 50% 75% max
AMBIENT_TEMPERATURE 3182.00 25.53 3.35 20.40 22.71 24.61 27.92 35.25
MODULE_TEMPERATURE 3182.00 31.09 12.26 18.14 21.09 24.62 41.31 65.55
IRRADIATION 3182.00 0.23 0.30 0.00 0.00 0.02 0.45 1.22

We check the categorical variable, which is the sensor identifier.

p1w.SOURCE_KEY.nunique()
1

There is only one sensor for environmental variables in the plant.

We revise the date.

p1w.DATE_TIME.dt.date.value_counts().sort_index().plot.bar(figsize = (12,8));

png

Conclusions:

 * The data period is between May 15, 2020 and June 17, 2020
 * We have data for every day, there are no intermediate days missing
 * But some days like 05/21 or 05/29 have fewer measurements
 * So it doesn't look 100% regular

Plant quality 2 - generation data

p2g.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67698 entries, 0 to 67697
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE_TIME    67698 non-null  object 
 1   PLANT_ID     67698 non-null  int64  
 2   SOURCE_KEY   67698 non-null  object 
 3   DC_POWER     67698 non-null  float64
 4   AC_POWER     67698 non-null  float64
 5   DAILY_YIELD  67698 non-null  float64
 6   TOTAL_YIELD  67698 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.6+ MB
p2g['DATE_TIME'] = pd.to_datetime(p2g.DATE_TIME)
p2g['PLANT_ID'] = p2g.PLANT_ID.replace(4136001, 'p2')
p2g.head()
DATE_TIME PLANT_ID SOURCE_KEY DC_POWER AC_POWER DAILY_YIELD TOTAL_YIELD
0 2020-05-15 p2 4UPUqMRk7TRMgml 0.00 0.00 9425.00 2429011.00
1 2020-05-15 p2 81aHJ1q11NBPMrL 0.00 0.00 0.00 1215278736.00
2 2020-05-15 p2 9kRcWv60rDACzjR 0.00 0.00 3075.33 2247719577.00
3 2020-05-15 p2 Et9kgGMDl729KT4 0.00 0.00 269.93 1704250.00
4 2020-05-15 p2 IQ2d7wF4YD8zU1Q 0.00 0.00 3177.00 19941526.00
p2g.describe().T
count mean std min 25% 50% 75% max
DC_POWER 67698.00 246.70 370.57 0.00 0.00 0.00 446.59 1420.93
AC_POWER 67698.00 241.28 362.11 0.00 0.00 0.00 438.22 1385.42
DAILY_YIELD 67698.00 3294.89 2919.45 0.00 272.75 2911.00 5534.00 9873.00
TOTAL_YIELD 67698.00 658944788.42 729667771.07 0.00 19964944.87 282627587.00 1348495113.00 2247916295.00

In this case the values of DC and AC are much closer to each other.

Let's calculate the ratio.

(p2g.DC_POWER / p2g.AC_POWER).describe()
count          32036.00
mean               1.02
std                0.01
min                0.99
25%                1.02
50%                1.02
75%                1.03
max                1.10
dtype: float64

Now the values of the ratio are very close to one.

We analyze the categorical variable, which is the identifier of the inverters.

p2g.SOURCE_KEY.nunique()
22
p2g.SOURCE_KEY.value_counts()
xoJJ8DcxJEcupym    3259
WcxssY2VbP4hApt    3259
9kRcWv60rDACzjR    3259
vOuJvMaM2sgwLmb    3259
rrq4fwE8jgrTyWY    3259
LYwnQax7tkwH5Cb    3259
LlT2YUhhzqhg5Sw    3259
q49J1IKaHRwDQnt    3259
oZZkBaNadn6DNKz    3259
PeE6FRyGXUgsRhN    3259
81aHJ1q11NBPMrL    3259
V94E5Ben1TlhnDV    3259
oZ35aAeoifZaQzV    3195
4UPUqMRk7TRMgml    3195
Qf4GUc1pJu5T6c6    3195
Mx2yZCDsyf6DPfv    3195
Et9kgGMDl729KT4    3195
Quc1TzYxW2pYoWX    3195
mqwcsP2rE7J0TFp    2355
NgDl19wMapZy17u    2355
IQ2d7wF4YD8zU1Q    2355
xMbIugepa2P7lBB    2355
Name: SOURCE_KEY, dtype: int64

Conclusiones:

* La planta 2 tiene 22 inverters
* Todos tienen un número similar de medidas aunque no exactamente igual
* A excepción de 4 que tienen unas 800 medidas menos
* Lo apuntamos para la fase de análisis

Por último vamos a analizar la fecha.

p2g.DATE_TIME.dt.date.value_counts().sort_index().plot.bar(figsize = (12,8));

png

Conclusiones:

* El período de datos es entre el 15 de Mayo del 2020 y el 17 de Junio de 2020
* Tenemos datos para todos los días, no falta ninguno intermedio
* Pero algunos días como el 20/05 y varios más tienen menos mediciones
* Por lo que no parece 100% regular

Calidad de datos planta 2 - datos de sensor ambiental

p2w.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3259 entries, 0 to 3258
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   DATE_TIME            3259 non-null   object 
 1   PLANT_ID             3259 non-null   int64  
 2   SOURCE_KEY           3259 non-null   object 
 3   AMBIENT_TEMPERATURE  3259 non-null   float64
 4   MODULE_TEMPERATURE   3259 non-null   float64
 5   IRRADIATION          3259 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 152.9+ KB

We correct the type of DATE_TIME

p2w.DATE_TIME = pd.to_datetime(p2w.DATE_TIME)
p2w.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3259 entries, 0 to 3258
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   DATE_TIME            3259 non-null   datetime64[ns]
 1   PLANT_ID             3259 non-null   int64         
 2   SOURCE_KEY           3259 non-null   object        
 3   AMBIENT_TEMPERATURE  3259 non-null   float64       
 4   MODULE_TEMPERATURE   3259 non-null   float64       
 5   IRRADIATION          3259 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 152.9+ KB
p2w.head()
DATE_TIME PLANT_ID SOURCE_KEY AMBIENT_TEMPERATURE MODULE_TEMPERATURE IRRADIATION
0 2020-05-15 00:00:00 4136001 iq8k7ZNt4Mwm3w0 27.00 25.06 0.00
1 2020-05-15 00:15:00 4136001 iq8k7ZNt4Mwm3w0 26.88 24.42 0.00
2 2020-05-15 00:30:00 4136001 iq8k7ZNt4Mwm3w0 26.68 24.43 0.00
3 2020-05-15 00:45:00 4136001 iq8k7ZNt4Mwm3w0 26.50 24.42 0.00
4 2020-05-15 01:00:00 4136001 iq8k7ZNt4Mwm3w0 26.60 25.09 0.00

We replaced the name of the plant.

p2w['PLANT_ID'] = p2w.PLANT_ID.replace(4136001,'p2')
p2w
DATE_TIME PLANT_ID SOURCE_KEY AMBIENT_TEMPERATURE MODULE_TEMPERATURE IRRADIATION
0 2020-05-15 00:00:00 p2 iq8k7ZNt4Mwm3w0 27.00 25.06 0.00
1 2020-05-15 00:15:00 p2 iq8k7ZNt4Mwm3w0 26.88 24.42 0.00
2 2020-05-15 00:30:00 p2 iq8k7ZNt4Mwm3w0 26.68 24.43 0.00
3 2020-05-15 00:45:00 p2 iq8k7ZNt4Mwm3w0 26.50 24.42 0.00
4 2020-05-15 01:00:00 p2 iq8k7ZNt4Mwm3w0 26.60 25.09 0.00
... ... ... ... ... ... ...
3254 2020-06-17 22:45:00 p2 iq8k7ZNt4Mwm3w0 23.51 22.86 0.00
3255 2020-06-17 23:00:00 p2 iq8k7ZNt4Mwm3w0 23.48 22.74 0.00
3256 2020-06-17 23:15:00 p2 iq8k7ZNt4Mwm3w0 23.35 22.49 0.00
3257 2020-06-17 23:30:00 p2 iq8k7ZNt4Mwm3w0 23.29 22.37 0.00
3258 2020-06-17 23:45:00 p2 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00

3259 rows × 6 columns

We review the statistics.

p2w.describe().T
count mean std min 25% 50% 75% max
AMBIENT_TEMPERATURE 3259.00 28.07 4.06 20.94 24.60 26.98 31.06 39.18
MODULE_TEMPERATURE 3259.00 32.77 11.34 20.27 23.72 27.53 40.48 66.64
IRRADIATION 3259.00 0.23 0.31 0.00 0.00 0.02 0.44 1.10

We analyze the categorical variable, which is the sensor identifier.

p2w.SOURCE_KEY.nunique()
1

There is only one sensor for environmental variables in the plant.

We revise the date.

p2w.DATE_TIME.dt.date.value_counts().sort_index().plot.bar(figsize = (12,8));

png

Conclusions:

 * The data period is between May 15, 2020 and June 17, 2020
 * We have data for every day, there are no intermediate days missing
 * But some days like 05/15 or others have fewer measurements, although they are much less missing than in the other datasets
 * But it doesn't look 100% regular

Pending data quality issues for further analysis

  • On plant 1 it seems that the inverters are transforming only 10% from DC to AC, which a priori is very low.
  • On plant 2 the ratio is much closer to 1.
  • The measurement intervals are not 100% regular. There are days with fewer measurements, and there are also differences due to inverters.

CREATION OF THE ANALYTICAL DATAMART

We are going to do a union by parts.

First the two generation datasets. Which will be a stack of records since the fields are the same.

After the two of environmental measures. Which will be a stack of records since the fields are the same.

And finally we will cross both partials through integration by key fields.

Union of generation datasets

gener = pd.concat([p1g,p2g],axis = 'index')
gener
DATE_TIME PLANT_ID SOURCE_KEY DC_POWER AC_POWER DAILY_YIELD TOTAL_YIELD
0 2020-05-15 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 0.00 6259559.00
1 2020-05-15 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6183645.00
2 2020-05-15 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 6987759.00
3 2020-05-15 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7602960.00
4 2020-05-15 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7158964.00
... ... ... ... ... ... ... ...
67693 2020-06-17 23:45:00 p2 q49J1IKaHRwDQnt 0.00 0.00 4157.00 520758.00
67694 2020-06-17 23:45:00 p2 rrq4fwE8jgrTyWY 0.00 0.00 3931.00 121131356.00
67695 2020-06-17 23:45:00 p2 vOuJvMaM2sgwLmb 0.00 0.00 4322.00 2427691.00
67696 2020-06-17 23:45:00 p2 xMbIugepa2P7lBB 0.00 0.00 4218.00 106896394.00
67697 2020-06-17 23:45:00 p2 xoJJ8DcxJEcupym 0.00 0.00 4316.00 209335741.00

136476 rows × 7 columns

Let's rename the variables now to make them more descriptive and usable.

gener.columns = ['fecha','planta','inverter_id','kw_dc','kw_ac','kw_dia','kw_total']
gener
fecha planta inverter_id kw_dc kw_ac kw_dia kw_total
0 2020-05-15 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 0.00 6259559.00
1 2020-05-15 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6183645.00
2 2020-05-15 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 6987759.00
3 2020-05-15 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7602960.00
4 2020-05-15 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7158964.00
... ... ... ... ... ... ... ...
67693 2020-06-17 23:45:00 p2 q49J1IKaHRwDQnt 0.00 0.00 4157.00 520758.00
67694 2020-06-17 23:45:00 p2 rrq4fwE8jgrTyWY 0.00 0.00 3931.00 121131356.00
67695 2020-06-17 23:45:00 p2 vOuJvMaM2sgwLmb 0.00 0.00 4322.00 2427691.00
67696 2020-06-17 23:45:00 p2 xMbIugepa2P7lBB 0.00 0.00 4218.00 106896394.00
67697 2020-06-17 23:45:00 p2 xoJJ8DcxJEcupym 0.00 0.00 4316.00 209335741.00

136476 rows × 7 columns

Now that we have the 2 plants together we are going to do what is called a consistency analysis, since according to the documentation kw_dia and kw_total are directly related to kw_dc and kw_ac.

We are going to try to replicate the data of kw_dia and kw_total.

gener2 = gener.copy()

Creamos una variable date para poder agregar por ella.

gener2['date'] = gener2.fecha.dt.date
gener2
fecha planta inverter_id kw_dc kw_ac kw_dia kw_total date
0 2020-05-15 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 0.00 6259559.00 2020-05-15
1 2020-05-15 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6183645.00 2020-05-15
2 2020-05-15 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 6987759.00 2020-05-15
3 2020-05-15 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7602960.00 2020-05-15
4 2020-05-15 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7158964.00 2020-05-15
... ... ... ... ... ... ... ... ...
67693 2020-06-17 23:45:00 p2 q49J1IKaHRwDQnt 0.00 0.00 4157.00 520758.00 2020-06-17
67694 2020-06-17 23:45:00 p2 rrq4fwE8jgrTyWY 0.00 0.00 3931.00 121131356.00 2020-06-17
67695 2020-06-17 23:45:00 p2 vOuJvMaM2sgwLmb 0.00 0.00 4322.00 2427691.00 2020-06-17
67696 2020-06-17 23:45:00 p2 xMbIugepa2P7lBB 0.00 0.00 4218.00 106896394.00 2020-06-17
67697 2020-06-17 23:45:00 p2 xoJJ8DcxJEcupym 0.00 0.00 4316.00 209335741.00 2020-06-17

136476 rows × 8 columns

The sum per plant, date and inverter of kw_dc or kw_ac should match the maximum of kw_dia.

gener2 = gener2.groupby(['planta','date','inverter_id']).agg({'kw_dc':sum,
                                                              'kw_ac':sum,
                                                              'kw_dia':max,
                                                              'kw_total':max}).reset_index()
gener2
planta date inverter_id kw_dc kw_ac kw_dia kw_total
0 p1 2020-05-15 1BY6WEcLGh8j5v7 235340.70 23046.55 5754.00 6265313.00
1 p1 2020-05-15 1IF53ai7Xc0U56Y 258911.11 25343.29 6357.00 6190002.00
2 p1 2020-05-15 3PZuoBAID5Wc2HD 254766.05 24937.70 6274.00 6994033.00
3 p1 2020-05-15 7JYdWkrLSPkdwr4 250608.34 24533.91 6116.00 7609076.00
4 p1 2020-05-15 McdE0feGgRqW7Ca 264030.98 25840.63 6471.00 7165435.00
... ... ... ... ... ... ... ...
1459 p2 2020-06-17 q49J1IKaHRwDQnt 17001.51 16655.15 4157.00 520758.00
1460 p2 2020-06-17 rrq4fwE8jgrTyWY 16073.93 15748.92 3931.00 121131356.00
1461 p2 2020-06-17 vOuJvMaM2sgwLmb 17710.00 17345.44 4322.00 2427691.00
1462 p2 2020-06-17 xMbIugepa2P7lBB 17211.23 16860.38 5502.00 106896394.00
1463 p2 2020-06-17 xoJJ8DcxJEcupym 17640.42 17278.51 5327.00 209335741.00

1464 rows × 7 columns

We order to be able to analyze.

gener2 = gener2.sort_values(['planta','inverter_id','date'])
gener2
planta date inverter_id kw_dc kw_ac kw_dia kw_total
0 p1 2020-05-15 1BY6WEcLGh8j5v7 235340.70 23046.55 5754.00 6265313.00
22 p1 2020-05-16 1BY6WEcLGh8j5v7 256629.88 25124.49 6292.00 6271605.00
44 p1 2020-05-17 1BY6WEcLGh8j5v7 288039.82 28172.85 7045.00 6278650.00
66 p1 2020-05-18 1BY6WEcLGh8j5v7 204030.30 19970.51 4998.00 6283648.00
88 p1 2020-05-19 1BY6WEcLGh8j5v7 232277.27 22741.18 6449.00 6290097.00
... ... ... ... ... ... ... ...
1375 p2 2020-06-13 xoJJ8DcxJEcupym 27443.74 26840.58 6632.00 209312200.00
1397 p2 2020-06-14 xoJJ8DcxJEcupym 30768.78 30088.97 7268.00 209319687.00
1419 p2 2020-06-15 xoJJ8DcxJEcupym 25597.24 25049.49 7412.67 209325949.00
1441 p2 2020-06-16 xoJJ8DcxJEcupym 22335.69 21870.63 6203.20 209331425.00
1463 p2 2020-06-17 xoJJ8DcxJEcupym 17640.42 17278.51 5327.00 209335741.00

1464 rows × 7 columns

Kw_dia does not match either kw_dc or kw_ac at all.

We are going to see if it agrees with kw_total, for this we calculate the daily increase of kw_total that should coincide with the maximum of kw_day of the previous day.

gener2['lag1'] = gener2.groupby(['planta','inverter_id']).kw_total.shift(1)
gener2['incremento'] = gener2.kw_total - gener2.lag1
gener2
planta date inverter_id kw_dc kw_ac kw_dia kw_total lag1 incremento
0 p1 2020-05-15 1BY6WEcLGh8j5v7 235340.70 23046.55 5754.00 6265313.00 NaN NaN
22 p1 2020-05-16 1BY6WEcLGh8j5v7 256629.88 25124.49 6292.00 6271605.00 6265313.00 6292.00
44 p1 2020-05-17 1BY6WEcLGh8j5v7 288039.82 28172.85 7045.00 6278650.00 6271605.00 7045.00
66 p1 2020-05-18 1BY6WEcLGh8j5v7 204030.30 19970.51 4998.00 6283648.00 6278650.00 4998.00
88 p1 2020-05-19 1BY6WEcLGh8j5v7 232277.27 22741.18 6449.00 6290097.00 6283648.00 6449.00
... ... ... ... ... ... ... ... ... ...
1375 p2 2020-06-13 xoJJ8DcxJEcupym 27443.74 26840.58 6632.00 209312200.00 209305520.00 6680.00
1397 p2 2020-06-14 xoJJ8DcxJEcupym 30768.78 30088.97 7268.00 209319687.00 209312200.00 7487.00
1419 p2 2020-06-15 xoJJ8DcxJEcupym 25597.24 25049.49 7412.67 209325949.00 209319687.00 6262.00
1441 p2 2020-06-16 xoJJ8DcxJEcupym 22335.69 21870.63 6203.20 209331425.00 209325949.00 5476.00
1463 p2 2020-06-17 xoJJ8DcxJEcupym 17640.42 17278.51 5327.00 209335741.00 209331425.00 4316.00

1464 rows × 9 columns

We check on plant 1.

gener2[gener2.planta == 'p1'].head(50)
planta date inverter_id kw_dc kw_ac kw_dia kw_total lag1 incremento
0 p1 2020-05-15 1BY6WEcLGh8j5v7 235340.70 23046.55 5754.00 6265313.00 NaN NaN
22 p1 2020-05-16 1BY6WEcLGh8j5v7 256629.88 25124.49 6292.00 6271605.00 6265313.00 6292.00
44 p1 2020-05-17 1BY6WEcLGh8j5v7 288039.82 28172.85 7045.00 6278650.00 6271605.00 7045.00
66 p1 2020-05-18 1BY6WEcLGh8j5v7 204030.30 19970.51 4998.00 6283648.00 6278650.00 4998.00
88 p1 2020-05-19 1BY6WEcLGh8j5v7 232277.27 22741.18 6449.00 6290097.00 6283648.00 6449.00
110 p1 2020-05-20 1BY6WEcLGh8j5v7 230412.62 22516.26 8249.00 6298346.00 6290097.00 8249.00
132 p1 2020-05-21 1BY6WEcLGh8j5v7 288676.60 28223.13 7243.00 6305589.00 6298346.00 7243.00
154 p1 2020-05-22 1BY6WEcLGh8j5v7 280809.07 27456.63 6848.00 6312437.00 6305589.00 6848.00
176 p1 2020-05-23 1BY6WEcLGh8j5v7 326468.27 31922.93 7966.00 6320403.00 6312437.00 7966.00
198 p1 2020-05-24 1BY6WEcLGh8j5v7 309111.73 30220.37 7537.00 6327940.00 6320403.00 7537.00
220 p1 2020-05-25 1BY6WEcLGh8j5v7 339109.95 33144.65 8268.00 6336208.00 6327940.00 8268.00
242 p1 2020-05-26 1BY6WEcLGh8j5v7 305515.95 29873.74 7456.43 6343669.00 6336208.00 7461.00
264 p1 2020-05-27 1BY6WEcLGh8j5v7 251250.52 24595.61 6164.00 6349833.00 6343669.00 6164.00
286 p1 2020-05-28 1BY6WEcLGh8j5v7 326520.42 31917.63 7977.00 6357810.00 6349833.00 7977.00
308 p1 2020-05-29 1BY6WEcLGh8j5v7 310776.57 30387.48 7564.00 6365374.00 6357810.00 7564.00
330 p1 2020-05-30 1BY6WEcLGh8j5v7 276466.89 27054.69 6754.00 6372128.00 6365374.00 6754.00
352 p1 2020-05-31 1BY6WEcLGh8j5v7 234509.23 22928.22 5803.00 6377931.00 6372128.00 5803.00
374 p1 2020-06-01 1BY6WEcLGh8j5v7 225219.77 22033.25 5508.00 6383439.00 6377931.00 5508.00
396 p1 2020-06-02 1BY6WEcLGh8j5v7 286633.75 28043.89 7029.00 6390468.00 6383439.00 7029.00
418 p1 2020-06-03 1BY6WEcLGh8j5v7 287069.37 28085.88 7341.00 6397809.00 6390468.00 7341.00
440 p1 2020-06-04 1BY6WEcLGh8j5v7 309653.82 30287.02 7571.00 6405380.00 6397809.00 7571.00
462 p1 2020-06-05 1BY6WEcLGh8j5v7 293353.25 28693.91 7162.00 6412542.00 6405380.00 7162.00
484 p1 2020-06-06 1BY6WEcLGh8j5v7 243297.18 23793.42 5940.00 6418482.00 6412542.00 5940.00
506 p1 2020-06-07 1BY6WEcLGh8j5v7 217217.71 21247.58 5268.00 6423750.00 6418482.00 5268.00
528 p1 2020-06-08 1BY6WEcLGh8j5v7 316710.85 30969.72 7864.00 6431614.00 6423750.00 7864.00
550 p1 2020-06-09 1BY6WEcLGh8j5v7 301395.46 29476.87 7456.00 6439070.00 6431614.00 7456.00
572 p1 2020-06-10 1BY6WEcLGh8j5v7 242187.71 23707.68 5911.00 6444981.00 6439070.00 5911.00
594 p1 2020-06-11 1BY6WEcLGh8j5v7 214404.59 21003.57 5257.00 6450238.00 6444981.00 5257.00
616 p1 2020-06-12 1BY6WEcLGh8j5v7 222758.55 21813.00 5441.00 6455679.00 6450238.00 5441.00
638 p1 2020-06-13 1BY6WEcLGh8j5v7 326297.09 31895.25 7984.00 6463663.00 6455679.00 7984.00
660 p1 2020-06-14 1BY6WEcLGh8j5v7 164127.30 16085.33 4012.00 6467675.00 6463663.00 4012.00
682 p1 2020-06-15 1BY6WEcLGh8j5v7 255298.41 24986.86 6275.00 6473950.00 6467675.00 6275.00
704 p1 2020-06-16 1BY6WEcLGh8j5v7 237829.61 23288.02 5848.00 6479798.00 6473950.00 5848.00
726 p1 2020-06-17 1BY6WEcLGh8j5v7 224315.74 21957.17 5521.00 6485319.00 6479798.00 5521.00
1 p1 2020-05-15 1IF53ai7Xc0U56Y 258911.11 25343.29 6357.00 6190002.00 NaN NaN
23 p1 2020-05-16 1IF53ai7Xc0U56Y 269678.23 26396.93 6592.00 6196594.00 6190002.00 6592.00
45 p1 2020-05-17 1IF53ai7Xc0U56Y 316701.61 30962.51 7759.00 6204353.00 6196594.00 7759.00
67 p1 2020-05-18 1IF53ai7Xc0U56Y 221984.75 21723.95 5453.00 6209806.00 6204353.00 5453.00
89 p1 2020-05-19 1IF53ai7Xc0U56Y 237092.43 23202.78 6568.75 6216430.00 6209806.00 6624.00
111 p1 2020-05-20 1IF53ai7Xc0U56Y 251521.07 24564.96 8997.00 6225427.00 6216430.00 8997.00
133 p1 2020-05-21 1IF53ai7Xc0U56Y 310547.48 30352.05 7816.50 6233247.00 6225427.00 7820.00
155 p1 2020-05-22 1IF53ai7Xc0U56Y 296998.09 29011.33 7287.00 6240534.00 6233247.00 7287.00
177 p1 2020-05-23 1IF53ai7Xc0U56Y 355283.96 34726.34 8673.00 6249207.00 6240534.00 8673.00
199 p1 2020-05-24 1IF53ai7Xc0U56Y 339436.82 33170.41 8273.00 6257480.00 6249207.00 8273.00
221 p1 2020-05-25 1IF53ai7Xc0U56Y 371166.75 36249.54 9048.00 6266528.00 6257480.00 9048.00
243 p1 2020-05-26 1IF53ai7Xc0U56Y 332953.20 32544.54 8109.50 6274638.00 6266528.00 8110.00
265 p1 2020-05-27 1IF53ai7Xc0U56Y 279934.18 27390.71 6824.00 6281462.00 6274638.00 6824.00
287 p1 2020-05-28 1IF53ai7Xc0U56Y 348916.90 34097.52 8496.00 6289958.00 6281462.00 8496.00
309 p1 2020-05-29 1IF53ai7Xc0U56Y 338805.39 33116.10 8253.00 6298211.00 6289958.00 8253.00
331 p1 2020-05-30 1IF53ai7Xc0U56Y 300749.29 29414.52 7332.00 6305543.00 6298211.00 7332.00

We check on plant 2.

gener2[gener2.planta == 'p2'].head(50)
planta date inverter_id kw_dc kw_ac kw_dia kw_total lag1 incremento
748 p2 2020-05-15 4UPUqMRk7TRMgml 17112.65 16744.08 9425.00 2433212.00 NaN NaN
770 p2 2020-05-16 4UPUqMRk7TRMgml 23305.45 22791.22 5677.00 2438889.00 2433212.00 5677.00
792 p2 2020-05-17 4UPUqMRk7TRMgml 25985.95 25430.42 6342.00 2445231.00 2438889.00 6342.00
814 p2 2020-05-18 4UPUqMRk7TRMgml 31218.58 30516.03 7641.00 2452872.00 2445231.00 7641.00
836 p2 2020-05-19 4UPUqMRk7TRMgml 8040.00 7878.39 7641.00 2454841.00 2452872.00 1969.00
858 p2 2020-05-20 4UPUqMRk7TRMgml 9852.13 9641.35 9423.00 2464264.00 2454841.00 9423.00
880 p2 2020-05-21 4UPUqMRk7TRMgml 37403.77 36540.63 9423.00 2473415.00 2464264.00 9151.00
898 p2 2020-05-22 4UPUqMRk7TRMgml 40414.35 39472.14 9863.00 2483278.00 2473415.00 9863.00
916 p2 2020-05-23 4UPUqMRk7TRMgml 39690.66 38769.69 9863.00 2492966.00 2483278.00 9688.00
934 p2 2020-05-24 4UPUqMRk7TRMgml 18963.72 18558.18 9688.00 2497601.00 2492966.00 4635.00
952 p2 2020-05-25 4UPUqMRk7TRMgml 37659.21 36789.31 9224.00 2506825.00 2497601.00 9224.00
970 p2 2020-05-26 4UPUqMRk7TRMgml 38900.22 37995.67 9501.00 2516326.00 2506825.00 9501.00
988 p2 2020-05-27 4UPUqMRk7TRMgml 38919.76 38017.83 9501.00 2525827.00 2516326.00 9501.00
1006 p2 2020-05-28 4UPUqMRk7TRMgml 15261.36 14937.97 9501.00 2529556.00 2525827.00 3729.00
1024 p2 2020-05-29 4UPUqMRk7TRMgml 32521.41 31811.48 8126.00 2537682.00 2529556.00 8126.00
1046 p2 2020-05-30 4UPUqMRk7TRMgml 33420.25 32663.54 8185.00 2545867.00 2537682.00 8185.00
1068 p2 2020-05-31 4UPUqMRk7TRMgml 27317.90 26701.78 8185.00 2552559.00 2545867.00 6692.00
1090 p2 2020-06-01 4UPUqMRk7TRMgml 17867.74 17462.72 6692.00 2556897.00 2552559.00 4338.00
1112 p2 2020-06-02 4UPUqMRk7TRMgml 30176.94 29508.22 7404.00 2564301.00 2556897.00 7404.00
1134 p2 2020-06-03 4UPUqMRk7TRMgml 23834.47 23319.73 7404.00 2570414.00 2564301.00 6113.00
1156 p2 2020-06-04 4UPUqMRk7TRMgml 28736.11 28109.01 6990.00 2577404.00 2570414.00 6990.00
1178 p2 2020-06-05 4UPUqMRk7TRMgml 31594.37 30884.49 7756.00 2585160.00 2577404.00 7756.00
1200 p2 2020-06-06 4UPUqMRk7TRMgml 34226.22 33435.09 8376.00 2593536.00 2585160.00 8376.00
1222 p2 2020-06-07 4UPUqMRk7TRMgml 32026.91 31285.99 8376.00 2601362.00 2593536.00 7826.00
1244 p2 2020-06-08 4UPUqMRk7TRMgml 37814.61 36940.31 9210.00 2610588.00 2601362.00 9226.00
1266 p2 2020-06-09 4UPUqMRk7TRMgml 33205.47 32465.97 9217.93 2618699.00 2610588.00 8111.00
1288 p2 2020-06-10 4UPUqMRk7TRMgml 22793.26 22291.98 8098.00 2624287.00 2618699.00 5588.00
1310 p2 2020-06-11 4UPUqMRk7TRMgml 16069.58 15745.07 5588.00 2628220.00 2624287.00 3933.00
1332 p2 2020-06-12 4UPUqMRk7TRMgml 18684.47 18306.11 4548.00 2632794.00 2628220.00 4574.00
1354 p2 2020-06-13 4UPUqMRk7TRMgml 25901.76 25335.74 6289.00 2639150.00 2632794.00 6356.00
1376 p2 2020-06-14 4UPUqMRk7TRMgml 8863.51 8699.19 6462.80 2641320.00 2639150.00 2170.00
1398 p2 2020-06-15 4UPUqMRk7TRMgml 10205.79 9999.45 2304.00 2643840.00 2641320.00 2520.00
1420 p2 2020-06-16 4UPUqMRk7TRMgml 21302.43 20858.99 5004.00 2649049.00 2643840.00 5209.00
1442 p2 2020-06-17 4UPUqMRk7TRMgml 18153.88 17780.37 5004.00 2653495.00 2649049.00 4446.00
749 p2 2020-05-15 81aHJ1q11NBPMrL 13169.45 12896.82 3214.00 1215281950.00 NaN NaN
771 p2 2020-05-16 81aHJ1q11NBPMrL 19238.71 18812.08 4857.00 1215286807.00 1215281950.00 4857.00
793 p2 2020-05-17 81aHJ1q11NBPMrL 23577.93 23078.49 5758.00 1215292565.00 1215286807.00 5758.00
815 p2 2020-05-18 81aHJ1q11NBPMrL 14117.46 13820.30 3485.00 1215296050.00 1215292565.00 3485.00
837 p2 2020-05-19 81aHJ1q11NBPMrL 27893.17 27253.87 6771.00 1215302821.00 1215296050.00 6771.00
859 p2 2020-05-20 81aHJ1q11NBPMrL 14040.81 13748.34 3439.00 1215306260.00 1215302821.00 3439.00
881 p2 2020-05-21 81aHJ1q11NBPMrL 29804.31 29105.05 8570.00 1215314830.00 1215306260.00 8570.00
899 p2 2020-05-22 81aHJ1q11NBPMrL 13268.32 12992.02 3269.00 1215318099.00 1215314830.00 3269.00
917 p2 2020-05-23 81aHJ1q11NBPMrL 27742.32 27098.60 9333.00 1215327432.00 1215318099.00 9333.00
935 p2 2020-05-24 81aHJ1q11NBPMrL 15849.25 15512.54 9100.00 1215336532.00 1215327432.00 9100.00
953 p2 2020-05-25 81aHJ1q11NBPMrL 35834.31 35018.35 8860.00 1215345392.00 1215336532.00 8860.00
971 p2 2020-05-26 81aHJ1q11NBPMrL 14655.46 14326.96 3212.67 1215348540.93 1215345392.00 3148.93
989 p2 2020-05-27 81aHJ1q11NBPMrL 5380.56 5270.20 9138.00 1215363661.00 1215348540.93 15120.07
1007 p2 2020-05-28 81aHJ1q11NBPMrL 36201.81 35369.55 8947.00 1215372608.00 1215363661.00 8947.00
1025 p2 2020-05-29 81aHJ1q11NBPMrL 18928.49 18527.04 4812.00 1215377420.00 1215372608.00 4812.00
1047 p2 2020-05-30 81aHJ1q11NBPMrL 29388.91 28733.23 7156.00 1215384576.00 1215377420.00 7156.00

Conclusions: * kw_dia is consistent with kw_total * but these are not consistent with kw_dc or kw_ac * it's as if they are in different units or there is some calculation that we are not aware of * therefore we will have 2 blocks to be able to use: either kw_dc with kw_ac, or kw_dia with kw_total, but we cannot mix them together

Union of environmental measurement datasets

temper = pd.concat([p1w,p2w], axis = 'index')
temper
DATE_TIME PLANT_ID SOURCE_KEY AMBIENT_TEMPERATURE MODULE_TEMPERATURE IRRADIATION
0 2020-05-15 00:00:00 p1 HmiyD2TTLFNqkNe 25.18 22.86 0.00
1 2020-05-15 00:15:00 p1 HmiyD2TTLFNqkNe 25.08 22.76 0.00
2 2020-05-15 00:30:00 p1 HmiyD2TTLFNqkNe 24.94 22.59 0.00
3 2020-05-15 00:45:00 p1 HmiyD2TTLFNqkNe 24.85 22.36 0.00
4 2020-05-15 01:00:00 p1 HmiyD2TTLFNqkNe 24.62 22.17 0.00
... ... ... ... ... ... ...
3254 2020-06-17 22:45:00 p2 iq8k7ZNt4Mwm3w0 23.51 22.86 0.00
3255 2020-06-17 23:00:00 p2 iq8k7ZNt4Mwm3w0 23.48 22.74 0.00
3256 2020-06-17 23:15:00 p2 iq8k7ZNt4Mwm3w0 23.35 22.49 0.00
3257 2020-06-17 23:30:00 p2 iq8k7ZNt4Mwm3w0 23.29 22.37 0.00
3258 2020-06-17 23:45:00 p2 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00

6441 rows × 6 columns

Let's rename the variables now to make them more descriptive and usable.

temper.columns = ['fecha','planta','sensor_id','t_ambiente','t_modulo','irradiacion']
temper
fecha planta sensor_id t_ambiente t_modulo irradiacion
0 2020-05-15 00:00:00 p1 HmiyD2TTLFNqkNe 25.18 22.86 0.00
1 2020-05-15 00:15:00 p1 HmiyD2TTLFNqkNe 25.08 22.76 0.00
2 2020-05-15 00:30:00 p1 HmiyD2TTLFNqkNe 24.94 22.59 0.00
3 2020-05-15 00:45:00 p1 HmiyD2TTLFNqkNe 24.85 22.36 0.00
4 2020-05-15 01:00:00 p1 HmiyD2TTLFNqkNe 24.62 22.17 0.00
... ... ... ... ... ... ...
3254 2020-06-17 22:45:00 p2 iq8k7ZNt4Mwm3w0 23.51 22.86 0.00
3255 2020-06-17 23:00:00 p2 iq8k7ZNt4Mwm3w0 23.48 22.74 0.00
3256 2020-06-17 23:15:00 p2 iq8k7ZNt4Mwm3w0 23.35 22.49 0.00
3257 2020-06-17 23:30:00 p2 iq8k7ZNt4Mwm3w0 23.29 22.37 0.00
3258 2020-06-17 23:45:00 p2 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00

6441 rows × 6 columns

Creation of the analytical datamart

In this case, the key field is made up of date and plant and commands the generation dataset, since the temperature field only provides us with additional variables.

df = pd.merge(left = gener, right = temper, how = 'left', on = ['fecha','planta'])
df
fecha planta inverter_id kw_dc kw_ac kw_dia kw_total sensor_id t_ambiente t_modulo irradiacion
0 2020-05-15 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 0.00 6259559.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
1 2020-05-15 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6183645.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
2 2020-05-15 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 6987759.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
3 2020-05-15 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7602960.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
4 2020-05-15 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7158964.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
... ... ... ... ... ... ... ... ... ... ... ...
136471 2020-06-17 23:45:00 p2 q49J1IKaHRwDQnt 0.00 0.00 4157.00 520758.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
136472 2020-06-17 23:45:00 p2 rrq4fwE8jgrTyWY 0.00 0.00 3931.00 121131356.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
136473 2020-06-17 23:45:00 p2 vOuJvMaM2sgwLmb 0.00 0.00 4322.00 2427691.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
136474 2020-06-17 23:45:00 p2 xMbIugepa2P7lBB 0.00 0.00 4218.00 106896394.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
136475 2020-06-17 23:45:00 p2 xoJJ8DcxJEcupym 0.00 0.00 4316.00 209335741.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00

136476 rows × 11 columns

After an integration it is always convenient to check if nulls have been generated.

df.isna().sum()
fecha          0
planta         0
inverter_id    0
kw_dc          0
kw_ac          0
kw_dia         0
kw_total       0
sensor_id      4
t_ambiente     4
t_modulo       4
irradiacion    4
dtype: int64

We search if the nulls fulfill some pattern.

nulos = df[df.sensor_id.isna()]
nulos
fecha planta inverter_id kw_dc kw_ac kw_dia kw_total sensor_id t_ambiente t_modulo irradiacion
38544 2020-06-03 14:00:00 p1 1IF53ai7Xc0U56Y 7003.00 685.80 5601.00 6330385.00 NaN NaN NaN NaN
38545 2020-06-03 14:00:00 p1 adLQvlD726eNBSB 7204.00 705.40 5685.00 6419961.00 NaN NaN NaN NaN
38546 2020-06-03 14:00:00 p1 wCURE6d3bPkepu2 7545.00 738.70 5579.00 6928448.00 NaN NaN NaN NaN
38547 2020-06-03 14:00:00 p1 z9Y9gH1T5YWrNuG 7946.00 777.80 5541.00 7152815.00 NaN NaN NaN NaN

It is about June 3 at 2:00 p.m., which for some reason does not have temperature data but only for 4 inverters on floor 1.

We are going to search the temperature dataset if that datetime exists.

temper[temper.fecha.between('2020-06-03 13:30:00', '2020-06-03 14:30:00')]
fecha planta sensor_id t_ambiente t_modulo irradiacion
1797 2020-06-03 13:30:00 p1 HmiyD2TTLFNqkNe 28.56 48.78 0.75
1798 2020-06-03 13:45:00 p1 HmiyD2TTLFNqkNe 27.86 46.63 0.62
1799 2020-06-03 14:15:00 p1 HmiyD2TTLFNqkNe 28.36 50.63 0.72
1800 2020-06-03 14:30:00 p1 HmiyD2TTLFNqkNe 28.31 47.38 0.67
1874 2020-06-03 13:30:00 p2 iq8k7ZNt4Mwm3w0 28.60 37.68 0.41
1875 2020-06-03 13:45:00 p2 iq8k7ZNt4Mwm3w0 28.75 36.35 0.35
1876 2020-06-03 14:15:00 p2 iq8k7ZNt4Mwm3w0 30.45 45.63 0.76
1877 2020-06-03 14:30:00 p2 iq8k7ZNt4Mwm3w0 29.64 40.40 0.44

Indeed we see that this section is missing on both floors. However, there are only measurements at that time on floor 1, and only on 4 inverters.

So there would be two solutions:

  • impute these data for these investments
  • delete those 4 records

Since it seems like a measurement strip of its own, only 4 inverters on floor 1, we are going to choose to eliminate them.

df.dropna(inplace = True)
df
fecha planta inverter_id kw_dc kw_ac kw_dia kw_total sensor_id t_ambiente t_modulo irradiacion
0 2020-05-15 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 0.00 6259559.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
1 2020-05-15 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6183645.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
2 2020-05-15 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 6987759.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
3 2020-05-15 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7602960.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
4 2020-05-15 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7158964.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
... ... ... ... ... ... ... ... ... ... ... ...
136471 2020-06-17 23:45:00 p2 q49J1IKaHRwDQnt 0.00 0.00 4157.00 520758.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
136472 2020-06-17 23:45:00 p2 rrq4fwE8jgrTyWY 0.00 0.00 3931.00 121131356.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
136473 2020-06-17 23:45:00 p2 vOuJvMaM2sgwLmb 0.00 0.00 4322.00 2427691.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
136474 2020-06-17 23:45:00 p2 xMbIugepa2P7lBB 0.00 0.00 4218.00 106896394.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
136475 2020-06-17 23:45:00 p2 xoJJ8DcxJEcupym 0.00 0.00 4316.00 209335741.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00

136472 rows × 11 columns

Finally we are going to pass the date to the index to be able to use all the power of Pandas.

df.set_index('fecha', inplace = True)
df
planta inverter_id kw_dc kw_ac kw_dia kw_total sensor_id t_ambiente t_modulo irradiacion
fecha
2020-05-15 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 0.00 6259559.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
2020-05-15 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6183645.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
2020-05-15 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 6987759.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
2020-05-15 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7602960.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
2020-05-15 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7158964.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
... ... ... ... ... ... ... ... ... ... ...
2020-06-17 23:45:00 p2 q49J1IKaHRwDQnt 0.00 0.00 4157.00 520758.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
2020-06-17 23:45:00 p2 rrq4fwE8jgrTyWY 0.00 0.00 3931.00 121131356.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
2020-06-17 23:45:00 p2 vOuJvMaM2sgwLmb 0.00 0.00 4322.00 2427691.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
2020-06-17 23:45:00 p2 xMbIugepa2P7lBB 0.00 0.00 4218.00 106896394.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
2020-06-17 23:45:00 p2 xoJJ8DcxJEcupym 0.00 0.00 4316.00 209335741.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00

136472 rows × 10 columns

GUARDAMOS EL DATAMART

df.to_pickle('df.pickle')

DATA TRANSFORMATION

SET UP

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import fiser_tools as fs
fs.misc.dark_theme()


#Automcompletar rápido
%config IPCompleter.greedy=True

#Formato de display
pd.options.display.float_format = '{:15.2f}'.format

DATA UPLOAD

df = pd.read_pickle('df.pickle')
df
planta inverter_id kw_dc kw_ac kw_dia kw_total sensor_id t_ambiente t_modulo irradiacion
fecha
2020-05-15 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 0.00 6259559.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
2020-05-15 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6183645.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
2020-05-15 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 6987759.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
2020-05-15 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7602960.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
2020-05-15 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7158964.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00
... ... ... ... ... ... ... ... ... ... ...
2020-06-17 23:45:00 p2 q49J1IKaHRwDQnt 0.00 0.00 4157.00 520758.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
2020-06-17 23:45:00 p2 rrq4fwE8jgrTyWY 0.00 0.00 3931.00 121131356.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
2020-06-17 23:45:00 p2 vOuJvMaM2sgwLmb 0.00 0.00 4322.00 2427691.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
2020-06-17 23:45:00 p2 xMbIugepa2P7lBB 0.00 0.00 4218.00 106896394.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00
2020-06-17 23:45:00 p2 xoJJ8DcxJEcupym 0.00 0.00 4316.00 209335741.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00

136472 rows × 10 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 136472 entries, 2020-05-15 00:00:00 to 2020-06-17 23:45:00
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   planta       136472 non-null  object 
 1   inverter_id  136472 non-null  object 
 2   kw_dc        136472 non-null  float64
 3   kw_ac        136472 non-null  float64
 4   kw_dia       136472 non-null  float64
 5   kw_total     136472 non-null  float64
 6   sensor_id    136472 non-null  object 
 7   t_ambiente   136472 non-null  float64
 8   t_modulo     136472 non-null  float64
 9   irradiacion  136472 non-null  float64
dtypes: float64(7), object(3)
memory usage: 11.5+ MB

CREATION OF VARIABLES

We start by extracting the date components and adding them as new variables.

def componentes_fecha(dataframe):
    mes = dataframe.index.month
    dia = dataframe.index.day
    hora = dataframe.index.hour
    minuto = dataframe.index.minute


    return(pd.DataFrame({'mes':mes, 'dia':dia, 'hora':hora, 'minuto':minuto}))
df = pd.concat([df.reset_index(),componentes_fecha(df)], axis = 1).set_index('fecha')
df
planta inverter_id kw_dc kw_ac kw_dia kw_total sensor_id t_ambiente t_modulo irradiacion mes dia hora minuto
fecha
2020-05-15 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 0.00 6259559.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00 5 15 0 0
2020-05-15 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6183645.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00 5 15 0 0
2020-05-15 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 6987759.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00 5 15 0 0
2020-05-15 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7602960.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00 5 15 0 0
2020-05-15 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7158964.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00 5 15 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-17 23:45:00 p2 q49J1IKaHRwDQnt 0.00 0.00 4157.00 520758.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00 6 17 23 45
2020-06-17 23:45:00 p2 rrq4fwE8jgrTyWY 0.00 0.00 3931.00 121131356.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00 6 17 23 45
2020-06-17 23:45:00 p2 vOuJvMaM2sgwLmb 0.00 0.00 4322.00 2427691.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00 6 17 23 45
2020-06-17 23:45:00 p2 xMbIugepa2P7lBB 0.00 0.00 4218.00 106896394.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00 6 17 23 45
2020-06-17 23:45:00 p2 xoJJ8DcxJEcupym 0.00 0.00 4316.00 209335741.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00 6 17 23 45

136472 rows × 14 columns

We are going to create the inverter efficiency variable, which consists of the percentage of DC that successfully transforms to AC.

But we are presented with a very common difficulty in ratios, that the denominator can be zero.

If that were the case, when doing the ratio it would return a null.

In our case the denominator is DC, so if the DC generation were zero, the AC generation should also be zero.

We can correct that simply by imputing the nulls that come out with zeros.

def eficiencia_inverter(AC,DC):
    temp = AC / DC * 100
    return(temp.fillna(0))
df['eficiencia'] = eficiencia_inverter(df.kw_ac, df.kw_dc)

We check that it has not generated nulls.

df.eficiencia.isna().sum()
0

We visualize efficiency globally.

df.eficiencia.plot.kde();

png

Here is something important.

There are two clearly differentiated groups and one of them is clearly inefficient.

But for now we'll leave it written down and later we'll review which entity is having problems: plant, inverter, etc.

DATAFRAME REORDERING

In this case it is very important not to start analyzing by analyzing, but to follow the plan defined in the project design, since there is a very clear order in the process: environmental factors --> kw_dc --> kw ac.

So let's rearrange the columns of the df to help us interpret in this order.

df
planta inverter_id kw_dc kw_ac kw_dia kw_total sensor_id t_ambiente t_modulo irradiacion mes dia hora minuto eficiencia
fecha
2020-05-15 00:00:00 p1 1BY6WEcLGh8j5v7 0.00 0.00 0.00 6259559.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00 5 15 0 0 0.00
2020-05-15 00:00:00 p1 1IF53ai7Xc0U56Y 0.00 0.00 0.00 6183645.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00 5 15 0 0 0.00
2020-05-15 00:00:00 p1 3PZuoBAID5Wc2HD 0.00 0.00 0.00 6987759.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00 5 15 0 0 0.00
2020-05-15 00:00:00 p1 7JYdWkrLSPkdwr4 0.00 0.00 0.00 7602960.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00 5 15 0 0 0.00
2020-05-15 00:00:00 p1 McdE0feGgRqW7Ca 0.00 0.00 0.00 7158964.00 HmiyD2TTLFNqkNe 25.18 22.86 0.00 5 15 0 0 0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-17 23:45:00 p2 q49J1IKaHRwDQnt 0.00 0.00 4157.00 520758.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00 6 17 23 45 0.00
2020-06-17 23:45:00 p2 rrq4fwE8jgrTyWY 0.00 0.00 3931.00 121131356.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00 6 17 23 45 0.00
2020-06-17 23:45:00 p2 vOuJvMaM2sgwLmb 0.00 0.00 4322.00 2427691.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00 6 17 23 45 0.00
2020-06-17 23:45:00 p2 xMbIugepa2P7lBB 0.00 0.00 4218.00 106896394.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00 6 17 23 45 0.00
2020-06-17 23:45:00 p2 xoJJ8DcxJEcupym 0.00 0.00 4316.00 209335741.00 iq8k7ZNt4Mwm3w0 23.20 22.54 0.00 6 17 23 45 0.00

136472 rows × 15 columns

orden = ['planta','mes','dia','hora','minuto','sensor_id','irradiacion','t_ambiente','t_modulo','inverter_id','kw_dc','kw_ac','eficiencia','kw_dia','kw_total']
df = df[orden]
df
planta mes dia hora minuto sensor_id irradiacion t_ambiente t_modulo inverter_id kw_dc kw_ac eficiencia kw_dia kw_total
fecha
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 1BY6WEcLGh8j5v7 0.00 0.00 0.00 0.00 6259559.00
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 1IF53ai7Xc0U56Y 0.00 0.00 0.00 0.00 6183645.00
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 3PZuoBAID5Wc2HD 0.00 0.00 0.00 0.00 6987759.00
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 7JYdWkrLSPkdwr4 0.00 0.00 0.00 0.00 7602960.00
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 McdE0feGgRqW7Ca 0.00 0.00 0.00 0.00 7158964.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 q49J1IKaHRwDQnt 0.00 0.00 0.00 4157.00 520758.00
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 rrq4fwE8jgrTyWY 0.00 0.00 0.00 3931.00 121131356.00
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 vOuJvMaM2sgwLmb 0.00 0.00 0.00 4322.00 2427691.00
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 xMbIugepa2P7lBB 0.00 0.00 0.00 4218.00 106896394.00
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 xoJJ8DcxJEcupym 0.00 0.00 0.00 4316.00 209335741.00

136472 rows × 15 columns

DAILY DATAFRAME

The level of analysis at which we have the data is every 15 minutes, which may be too disaggregated for certain analyses.

We are going to build a version of the dataframe added to the day level.

For this we use resample to do downgrading.

We must add by plant and inverter, which are the key fields of our dataset.

Since we have variables to which different aggregation functions apply we can use the dictionary format of agg()

df.head()
planta mes dia hora minuto sensor_id irradiacion t_ambiente t_modulo inverter_id kw_dc kw_ac eficiencia kw_dia kw_total
fecha
2020-05-15 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 1BY6WEcLGh8j5v7 0.00 0.00 0.00 0.00 6259559.00
2020-05-15 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 1IF53ai7Xc0U56Y 0.00 0.00 0.00 0.00 6183645.00
2020-05-15 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 3PZuoBAID5Wc2HD 0.00 0.00 0.00 0.00 6987759.00
2020-05-15 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 7JYdWkrLSPkdwr4 0.00 0.00 0.00 0.00 7602960.00
2020-05-15 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 McdE0feGgRqW7Ca 0.00 0.00 0.00 0.00 7158964.00
df_dia = df.groupby(['planta', 'inverter_id']).resample('D') \
    .agg({'irradiacion': [min,np.mean,max],
          't_ambiente': [min,np.mean,max],
          't_modulo': [min,np.mean,max],
          'kw_dc': [min,np.mean,max,sum],
          'kw_ac': [min,np.mean,max,sum],
          'eficiencia': [min,np.mean,max],
          'kw_dia': max,
          'kw_total': max})

df_dia
irradiacion t_ambiente t_modulo kw_dc kw_ac eficiencia kw_dia kw_total
min mean max min mean max min mean max min mean max sum min mean max sum min mean max max max
planta inverter_id fecha
p1 1BY6WEcLGh8j5v7 2020-05-15 0.00 0.20 0.89 22.04 27.43 34.43 20.29 32.58 55.03 0.00 2530.55 10642.75 235340.70 0.00 247.81 1039.35 23046.55 0.00 5.26 9.82 5754.00 6265313.00
2020-05-16 0.00 0.21 0.81 21.50 26.78 32.52 19.59 31.86 54.23 0.00 2916.25 11209.00 256629.88 0.00 285.51 1095.29 25124.49 0.00 5.56 9.83 6292.00 6271605.00
2020-05-17 0.00 0.24 1.00 21.21 26.69 35.25 20.38 32.74 63.15 0.00 3000.41 11416.43 288039.82 0.00 293.47 1114.81 28172.85 0.00 4.99 9.82 7045.00 6278650.00
2020-05-18 0.00 0.16 0.97 20.96 23.85 28.37 19.48 27.81 53.94 0.00 2125.32 12238.86 204030.30 0.00 208.03 1193.63 19970.51 0.00 4.99 9.83 4998.00 6283648.00
2020-05-19 0.00 0.19 0.84 22.39 25.34 30.37 20.06 29.73 51.85 0.00 2497.61 10854.50 232277.27 0.00 244.53 1059.80 22741.18 0.00 4.63 9.83 6449.00 6290097.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
p2 xoJJ8DcxJEcupym 2020-06-13 0.00 0.22 0.93 22.20 26.12 31.91 20.88 30.39 51.01 0.00 285.87 1271.15 27443.74 0.00 279.59 1239.95 26840.58 0.00 51.91 98.29 6632.00 209312200.00
2020-06-14 0.00 0.23 0.92 23.65 27.02 32.96 22.12 31.59 52.99 0.00 320.51 1362.06 30768.78 0.00 313.43 1328.21 30088.97 0.00 51.94 99.92 7268.00 209319687.00
2020-06-15 0.00 0.19 0.83 24.00 26.56 31.61 23.00 30.00 48.46 0.00 266.64 1288.28 25597.24 0.00 260.93 1256.67 25049.49 0.00 52.94 98.27 7412.67 209325949.00
2020-06-16 0.00 0.17 0.77 23.63 26.37 30.83 22.56 29.59 46.36 0.00 232.66 1124.97 22335.69 0.00 227.82 1098.21 21870.63 0.00 50.90 98.29 6203.20 209331425.00
2020-06-17 0.00 0.12 0.58 22.55 24.85 29.04 21.91 26.67 42.33 0.00 183.75 828.77 17640.42 0.00 179.98 810.77 17278.51 0.00 52.90 98.32 5327.00 209335741.00

1496 rows × 22 columns

It has been generated for us with a multi-index, both in rows and in columns.

To remove the from the columns we can flatten the names with .to_flat_index().

This returns the levels in tuples, which we can then join with a list comprehension.

Let's review how to_flat_index() returns the column names

tuplas = df_dia.columns.to_flat_index()
tuplas
Index([ ('irradiacion', 'min'), ('irradiacion', 'mean'),
        ('irradiacion', 'max'),   ('t_ambiente', 'min'),
        ('t_ambiente', 'mean'),   ('t_ambiente', 'max'),
           ('t_modulo', 'min'),    ('t_modulo', 'mean'),
           ('t_modulo', 'max'),        ('kw_dc', 'min'),
             ('kw_dc', 'mean'),        ('kw_dc', 'max'),
              ('kw_dc', 'sum'),        ('kw_ac', 'min'),
             ('kw_ac', 'mean'),        ('kw_ac', 'max'),
              ('kw_ac', 'sum'),   ('eficiencia', 'min'),
        ('eficiencia', 'mean'),   ('eficiencia', 'max'),
             ('kw_dia', 'max'),     ('kw_total', 'max')],
      dtype='object')

And we join both parts of the pair with an underscore using .join

df_dia.columns = ["_".join(par) for par in tuplas]
df_dia
irradiacion_min irradiacion_mean irradiacion_max t_ambiente_min t_ambiente_mean t_ambiente_max t_modulo_min t_modulo_mean t_modulo_max kw_dc_min kw_dc_mean kw_dc_max kw_dc_sum kw_ac_min kw_ac_mean kw_ac_max kw_ac_sum eficiencia_min eficiencia_mean eficiencia_max kw_dia_max kw_total_max
planta inverter_id fecha
p1 1BY6WEcLGh8j5v7 2020-05-15 0.00 0.20 0.89 22.04 27.43 34.43 20.29 32.58 55.03 0.00 2530.55 10642.75 235340.70 0.00 247.81 1039.35 23046.55 0.00 5.26 9.82 5754.00 6265313.00
2020-05-16 0.00 0.21 0.81 21.50 26.78 32.52 19.59 31.86 54.23 0.00 2916.25 11209.00 256629.88 0.00 285.51 1095.29 25124.49 0.00 5.56 9.83 6292.00 6271605.00
2020-05-17 0.00 0.24 1.00 21.21 26.69 35.25 20.38 32.74 63.15 0.00 3000.41 11416.43 288039.82 0.00 293.47 1114.81 28172.85 0.00 4.99 9.82 7045.00 6278650.00
2020-05-18 0.00 0.16 0.97 20.96 23.85 28.37 19.48 27.81 53.94 0.00 2125.32 12238.86 204030.30 0.00 208.03 1193.63 19970.51 0.00 4.99 9.83 4998.00 6283648.00
2020-05-19 0.00 0.19 0.84 22.39 25.34 30.37 20.06 29.73 51.85 0.00 2497.61 10854.50 232277.27 0.00 244.53 1059.80 22741.18 0.00 4.63 9.83 6449.00 6290097.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
p2 xoJJ8DcxJEcupym 2020-06-13 0.00 0.22 0.93 22.20 26.12 31.91 20.88 30.39 51.01 0.00 285.87 1271.15 27443.74 0.00 279.59 1239.95 26840.58 0.00 51.91 98.29 6632.00 209312200.00
2020-06-14 0.00 0.23 0.92 23.65 27.02 32.96 22.12 31.59 52.99 0.00 320.51 1362.06 30768.78 0.00 313.43 1328.21 30088.97 0.00 51.94 99.92 7268.00 209319687.00
2020-06-15 0.00 0.19 0.83 24.00 26.56 31.61 23.00 30.00 48.46 0.00 266.64 1288.28 25597.24 0.00 260.93 1256.67 25049.49 0.00 52.94 98.27 7412.67 209325949.00
2020-06-16 0.00 0.17 0.77 23.63 26.37 30.83 22.56 29.59 46.36 0.00 232.66 1124.97 22335.69 0.00 227.82 1098.21 21870.63 0.00 50.90 98.29 6203.20 209331425.00
2020-06-17 0.00 0.12 0.58 22.55 24.85 29.04 21.91 26.67 42.33 0.00 183.75 828.77 17640.42 0.00 179.98 810.77 17278.51 0.00 52.90 98.32 5327.00 209335741.00

1496 rows × 22 columns

Now we need to pass plant and inverter_id to columns, and leave the date as the index.

df_dia = df_dia.reset_index().set_index('fecha')
df_dia
planta inverter_id irradiacion_min irradiacion_mean irradiacion_max t_ambiente_min t_ambiente_mean t_ambiente_max t_modulo_min t_modulo_mean t_modulo_max kw_dc_min kw_dc_mean kw_dc_max kw_dc_sum kw_ac_min kw_ac_mean kw_ac_max kw_ac_sum eficiencia_min eficiencia_mean eficiencia_max kw_dia_max kw_total_max
fecha
2020-05-15 p1 1BY6WEcLGh8j5v7 0.00 0.20 0.89 22.04 27.43 34.43 20.29 32.58 55.03 0.00 2530.55 10642.75 235340.70 0.00 247.81 1039.35 23046.55 0.00 5.26 9.82 5754.00 6265313.00
2020-05-16 p1 1BY6WEcLGh8j5v7 0.00 0.21 0.81 21.50 26.78 32.52 19.59 31.86 54.23 0.00 2916.25 11209.00 256629.88 0.00 285.51 1095.29 25124.49 0.00 5.56 9.83 6292.00 6271605.00
2020-05-17 p1 1BY6WEcLGh8j5v7 0.00 0.24 1.00 21.21 26.69 35.25 20.38 32.74 63.15 0.00 3000.41 11416.43 288039.82 0.00 293.47 1114.81 28172.85 0.00 4.99 9.82 7045.00 6278650.00
2020-05-18 p1 1BY6WEcLGh8j5v7 0.00 0.16 0.97 20.96 23.85 28.37 19.48 27.81 53.94 0.00 2125.32 12238.86 204030.30 0.00 208.03 1193.63 19970.51 0.00 4.99 9.83 4998.00 6283648.00
2020-05-19 p1 1BY6WEcLGh8j5v7 0.00 0.19 0.84 22.39 25.34 30.37 20.06 29.73 51.85 0.00 2497.61 10854.50 232277.27 0.00 244.53 1059.80 22741.18 0.00 4.63 9.83 6449.00 6290097.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-13 p2 xoJJ8DcxJEcupym 0.00 0.22 0.93 22.20 26.12 31.91 20.88 30.39 51.01 0.00 285.87 1271.15 27443.74 0.00 279.59 1239.95 26840.58 0.00 51.91 98.29 6632.00 209312200.00
2020-06-14 p2 xoJJ8DcxJEcupym 0.00 0.23 0.92 23.65 27.02 32.96 22.12 31.59 52.99 0.00 320.51 1362.06 30768.78 0.00 313.43 1328.21 30088.97 0.00 51.94 99.92 7268.00 209319687.00
2020-06-15 p2 xoJJ8DcxJEcupym 0.00 0.19 0.83 24.00 26.56 31.61 23.00 30.00 48.46 0.00 266.64 1288.28 25597.24 0.00 260.93 1256.67 25049.49 0.00 52.94 98.27 7412.67 209325949.00
2020-06-16 p2 xoJJ8DcxJEcupym 0.00 0.17 0.77 23.63 26.37 30.83 22.56 29.59 46.36 0.00 232.66 1124.97 22335.69 0.00 227.82 1098.21 21870.63 0.00 50.90 98.29 6203.20 209331425.00
2020-06-17 p2 xoJJ8DcxJEcupym 0.00 0.12 0.58 22.55 24.85 29.04 21.91 26.67 42.33 0.00 183.75 828.77 17640.42 0.00 179.98 810.77 17278.51 0.00 52.90 98.32 5327.00 209335741.00

1496 rows × 24 columns

We already have our hourly and daily datasets ready.

We keep them.

df.to_pickle('df.pickle')
df_dia.to_pickle('df_dia.pickle')

ANALYSIS AND INSIGHTS

SET UP

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


#Automcompletar rápido
%config IPCompleter.greedy=True

#Formato de display
pd.options.display.float_format = '{:15.2f}'.format

#Formato de graficos
import fiser_tools as fs 
fs.misc.dark_theme()

CARGA DE DATOS

df = pd.read_pickle('df.pickle')
df.head()
planta mes dia hora minuto sensor_id irradiacion t_ambiente t_modulo inverter_id kw_dc kw_ac eficiencia kw_dia kw_total
fecha
2020-05-15 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 1BY6WEcLGh8j5v7 0.00 0.00 0.00 0.00 6259559.00
2020-05-15 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 1IF53ai7Xc0U56Y 0.00 0.00 0.00 0.00 6183645.00
2020-05-15 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 3PZuoBAID5Wc2HD 0.00 0.00 0.00 0.00 6987759.00
2020-05-15 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 7JYdWkrLSPkdwr4 0.00 0.00 0.00 0.00 7602960.00
2020-05-15 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 McdE0feGgRqW7Ca 0.00 0.00 0.00 0.00 7158964.00
df_dia = pd.read_pickle('df_dia.pickle')
df_dia.head()
planta inverter_id irradiacion_min irradiacion_mean irradiacion_max t_ambiente_min t_ambiente_mean t_ambiente_max t_modulo_min t_modulo_mean ... kw_dc_sum kw_ac_min kw_ac_mean kw_ac_max kw_ac_sum eficiencia_min eficiencia_mean eficiencia_max kw_dia_max kw_total_max
fecha
2020-05-15 p1 1BY6WEcLGh8j5v7 0.00 0.20 0.89 22.04 27.43 34.43 20.29 32.58 ... 235340.70 0.00 247.81 1039.35 23046.55 0.00 5.26 9.82 5754.00 6265313.00
2020-05-16 p1 1BY6WEcLGh8j5v7 0.00 0.21 0.81 21.50 26.78 32.52 19.59 31.86 ... 256629.88 0.00 285.51 1095.29 25124.49 0.00 5.56 9.83 6292.00 6271605.00
2020-05-17 p1 1BY6WEcLGh8j5v7 0.00 0.24 1.00 21.21 26.69 35.25 20.38 32.74 ... 288039.82 0.00 293.47 1114.81 28172.85 0.00 4.99 9.82 7045.00 6278650.00
2020-05-18 p1 1BY6WEcLGh8j5v7 0.00 0.16 0.97 20.96 23.85 28.37 19.48 27.81 ... 204030.30 0.00 208.03 1193.63 19970.51 0.00 4.99 9.83 4998.00 6283648.00
2020-05-19 p1 1BY6WEcLGh8j5v7 0.00 0.19 0.84 22.39 25.34 30.37 20.06 29.73 ... 232277.27 0.00 244.53 1059.80 22741.18 0.00 4.63 9.83 6449.00 6290097.00

5 rows × 24 columns

ANALYSIS AND INSIGHTS

The first lever is the reception of solar energy.

We have 3 kpis with which to measure this lever: incoming irradiation, ambient temperature, and module temperature.

These kpis are measured with a single sensor per plant, so the data is the same for all inverters.

We need to understand how these variables work with each other before moving on to see how they interact with the next level.

Since the inverter does not matter and we only need those 3 variables, we are going to create a smaller dataset with only one inverter from each plant to work on.

df
planta mes dia hora minuto sensor_id irradiacion t_ambiente t_modulo inverter_id kw_dc kw_ac eficiencia kw_dia kw_total
fecha
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 1BY6WEcLGh8j5v7 0.00 0.00 0.00 0.00 6259559.00
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 1IF53ai7Xc0U56Y 0.00 0.00 0.00 0.00 6183645.00
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 3PZuoBAID5Wc2HD 0.00 0.00 0.00 0.00 6987759.00
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 7JYdWkrLSPkdwr4 0.00 0.00 0.00 0.00 7602960.00
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 McdE0feGgRqW7Ca 0.00 0.00 0.00 0.00 7158964.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 q49J1IKaHRwDQnt 0.00 0.00 0.00 4157.00 520758.00
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 rrq4fwE8jgrTyWY 0.00 0.00 0.00 3931.00 121131356.00
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 vOuJvMaM2sgwLmb 0.00 0.00 0.00 4322.00 2427691.00
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 xMbIugepa2P7lBB 0.00 0.00 0.00 4218.00 106896394.00
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 xoJJ8DcxJEcupym 0.00 0.00 0.00 4316.00 209335741.00

136472 rows × 15 columns

recepcion = df.loc[(df.inverter_id == '1BY6WEcLGh8j5v7') | (df.inverter_id == 'q49J1IKaHRwDQnt'), 'planta':'t_modulo']
recepcion
planta mes dia hora minuto sensor_id irradiacion t_ambiente t_modulo
fecha
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86
2020-05-15 00:15:00 p1 5 15 0 15 HmiyD2TTLFNqkNe 0.00 25.08 22.76
2020-05-15 00:30:00 p1 5 15 0 30 HmiyD2TTLFNqkNe 0.00 24.94 22.59
2020-05-15 00:45:00 p1 5 15 0 45 HmiyD2TTLFNqkNe 0.00 24.85 22.36
2020-05-15 01:00:00 p1 5 15 1 0 HmiyD2TTLFNqkNe 0.00 24.62 22.17
... ... ... ... ... ... ... ... ... ...
2020-06-17 22:45:00 p2 6 17 22 45 iq8k7ZNt4Mwm3w0 0.00 23.51 22.86
2020-06-17 23:00:00 p2 6 17 23 0 iq8k7ZNt4Mwm3w0 0.00 23.48 22.74
2020-06-17 23:15:00 p2 6 17 23 15 iq8k7ZNt4Mwm3w0 0.00 23.35 22.49
2020-06-17 23:30:00 p2 6 17 23 30 iq8k7ZNt4Mwm3w0 0.00 23.29 22.37
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54

6413 rows × 9 columns

Do the two plants receive the same amount of solar energy?

temp = recepcion.groupby('planta').agg({'irradiacion':sum,'t_ambiente':np.mean,'t_modulo':np.mean})
temp
irradiacion t_ambiente t_modulo
planta
p1 726.40 25.56 31.18
p2 758.49 28.07 32.77
f, ax = plt.subplots(nrows=1, ncols=3, figsize = (18,5))

ax[0].bar(temp.index, temp.irradiacion, color = ['red','blue'], alpha = 0.3)
ax[1].bar(temp.index, temp.t_ambiente, color = ['red','blue'], alpha = 0.3)
ax[2].bar(temp.index, temp.t_modulo, color = ['red','blue'], alpha = 0.3)
ax[0].set_title('Irradiación por planta')
ax[1].set_title('Temperatura ambiente por planta')
ax[2].set_title('Temperatura módulo por planta');

png

Conclusions:

  • In general, plant 2 receives more solar energy than plant 1
  • But this difference cannot imply the performance problem that supposedly exists

How are these three variables related?

temp = recepcion.loc[:,['planta','irradiacion','t_ambiente','t_modulo']]
temp
planta irradiacion t_ambiente t_modulo
fecha
2020-05-15 00:00:00 p1 0.00 25.18 22.86
2020-05-15 00:15:00 p1 0.00 25.08 22.76
2020-05-15 00:30:00 p1 0.00 24.94 22.59
2020-05-15 00:45:00 p1 0.00 24.85 22.36
2020-05-15 01:00:00 p1 0.00 24.62 22.17
... ... ... ... ...
2020-06-17 22:45:00 p2 0.00 23.51 22.86
2020-06-17 23:00:00 p2 0.00 23.48 22.74
2020-06-17 23:15:00 p2 0.00 23.35 22.49
2020-06-17 23:30:00 p2 0.00 23.29 22.37
2020-06-17 23:45:00 p2 0.00 23.20 22.54

6413 rows × 4 columns

sns.heatmap(temp.corr(), annot=True);

png

sns.pairplot(temp.reset_index(), hue = 'planta', height=3, plot_kws={'alpha': 0.1});

png

Conclusions:

  • Irradiance highly correlates with module temperature
  • But not so much with room temperature
  • Therefore, a first way to identify defective or dirty modules is to locate those that produce little when the irradiation is high

How is the irradiation and temperature distributed throughout the day?

temp = pd.crosstab(recepcion.hora,recepcion.planta,values = recepcion.irradiacion,aggfunc='mean')
temp
planta p1 p2
hora
0 0.00 0.00
1 0.00 0.00
2 0.00 0.00
3 0.00 0.00
4 0.00 0.00
5 0.00 0.00
6 0.04 0.04
7 0.19 0.19
8 0.37 0.39
9 0.53 0.57
10 0.64 0.69
11 0.73 0.76
12 0.74 0.79
13 0.69 0.69
14 0.58 0.60
15 0.46 0.44
16 0.29 0.28
17 0.13 0.12
18 0.02 0.02
19 0.00 0.00
20 0.00 0.00
21 0.00 0.00
22 0.00 0.00
23 0.00 0.00
plt.figure(figsize=(10,10))
sns.heatmap(temp, annot=True, fmt=".2f");

png

temp = pd.crosstab(recepcion.hora,recepcion.planta,values = recepcion.t_ambiente,aggfunc='mean')
temp
planta p1 p2
hora
0 22.80 25.20
1 22.63 24.82
2 22.46 24.48
3 22.32 24.27
4 22.17 24.07
5 22.06 23.91
6 22.20 24.19
7 23.34 25.48
8 24.92 27.05
9 26.48 28.61
10 27.65 30.17
11 28.80 31.43
12 29.62 32.55
13 30.10 33.01
14 30.29 33.16
15 29.99 32.81
16 29.38 32.44
17 28.29 31.59
18 26.66 30.00
19 25.36 28.61
20 24.50 27.63
21 23.87 26.75
22 23.26 26.06
23 22.92 25.46
plt.figure(figsize=(10,10))
sns.heatmap(temp, annot=True, fmt=".1f");

png

Conclusions:

  • Both plants have similar patterns. We might think that they are in geographical areas not far away
  • There is irradiation (and therefore a priori the plants should produce) between 7 a.m. and 5 p.m.
  • The maximum irradiation occurs between 11 and 12
  • The maximum room temperature occurs between 14 and 16

Are both plants equally capable of generating DC from irradiation?

plt.figure(figsize = (12,8))
sns.scatterplot(data = df, x = df.irradiacion, y = df.kw_dc);

png

There are 2 clearly different patterns. Could it be the plants?

plt.figure(figsize = (12,8))
sns.scatterplot(data = df, x = df.irradiacion, y = df.kw_dc, hue = 'planta');

png

Plant number 2 produces much less kW at the same irradiation levels.

But before we had seen that the relationship between dc and ac in plant 1 was strange.

And also that the data for dc and ac did not match those of kw_dia.

There is something strange in the data.

Let's see the relationship between irradiation and kw_dia to see if it gives us light.

plt.figure(figsize = (12,10))
sns.scatterplot(data = df, x = df.irradiacion, y = df.kw_dia, hue = 'planta');

png

It's very strange. It seems that the relationship is that the more irradiation, the less kW generated. Which doesn't make sense.

It even seems that the kw maximums occur in hours of zero irradiation.

Can you imagine what could be happening?

BEWARE: the variable kw_dia is a CUMULATIVE. That means that it should reach its maximum when the last hour of the day arrives, for example 23:45, where obviously the irradiation is zero.

And not have data until after 7, which is when we see that there is irradiation.

Let's check it out.

df.groupby('hora')[['kw_dia']].mean().plot.bar();

png

Again something doesn't add up. There is generation between 00 and 06.

And also after 6:00 p.m. it begins to decline, which should not happen if it is accumulated.

Conclusion:

We don't trust these cumulative variables like kw_day and kw_total.

But the truth is that we don't trust the others much either.

In a real situation I would stop the project until I was able to see what happens with the data.

But in order to continue we are going to assume that the dc and ac data are correct.

And under this assumption we will obtain our conclusions.

INSIGHT #1

Plant 2 generates much lower levels of DC even at similar levels of irradiation

Is the generation constant throughout the days?

We can use the df_dia to plot the global vision of DC generation during the analysis period.

plt.figure(figsize = (10,8))
sns.lineplot(data = df_dia.reset_index(), x = df_dia.reset_index().fecha, y = 'kw_dc_sum', hue = 'planta');

png

We see that plant 1 has much more variability while plant 2 is much more constant.

But above all we are surprised by the low levels of DC generation on plant 2 compared to 1.

Let's examine the generation of each day to see if we see something strange.

We generate a date variable to be able to add by it.

df['date'] = df.index.date
df
planta mes dia hora minuto sensor_id irradiacion t_ambiente t_modulo inverter_id kw_dc kw_ac eficiencia kw_dia kw_total date
fecha
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 1BY6WEcLGh8j5v7 0.00 0.00 0.00 0.00 6259559.00 2020-05-15
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 1IF53ai7Xc0U56Y 0.00 0.00 0.00 0.00 6183645.00 2020-05-15
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 3PZuoBAID5Wc2HD 0.00 0.00 0.00 0.00 6987759.00 2020-05-15
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 7JYdWkrLSPkdwr4 0.00 0.00 0.00 0.00 7602960.00 2020-05-15
2020-05-15 00:00:00 p1 5 15 0 0 HmiyD2TTLFNqkNe 0.00 25.18 22.86 McdE0feGgRqW7Ca 0.00 0.00 0.00 0.00 7158964.00 2020-05-15
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 q49J1IKaHRwDQnt 0.00 0.00 0.00 4157.00 520758.00 2020-06-17
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 rrq4fwE8jgrTyWY 0.00 0.00 0.00 3931.00 121131356.00 2020-06-17
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 vOuJvMaM2sgwLmb 0.00 0.00 0.00 4322.00 2427691.00 2020-06-17
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 xMbIugepa2P7lBB 0.00 0.00 0.00 4218.00 106896394.00 2020-06-17
2020-06-17 23:45:00 p2 6 17 23 45 iq8k7ZNt4Mwm3w0 0.00 23.20 22.54 xoJJ8DcxJEcupym 0.00 0.00 0.00 4316.00 209335741.00 2020-06-17

136472 rows × 16 columns

We create a temporal dataframe to analyze the hourly DC generation on each day at plant 1.

dc_constante_p1 = df[df.planta == 'p1'].groupby(['planta','date','hora']).kw_dc.sum()
dc_constante_p1
planta  date        hora
p1      2020-05-15  0                 0.00
                    1                 0.00
                    2                 0.00
                    3                 0.00
                    4                 0.00
                                 ...      
        2020-06-17  19                0.00
                    20                0.00
                    21                0.00
                    22                0.00
                    23                0.00
Name: kw_dc, Length: 796, dtype: float64

We create a temporal dataframe to analyze the hourly DC generation on each day at plant 1.

dc_constante_p1.unstack(level = 1).plot(subplots = True, layout = (17,2), sharex=True, figsize=(20,30));

png

Conclusions:

  • On floor 1, similar patterns are maintained every day
  • Except for a break on May 20 and a strange crash on June 5
  • But none appear to be structural
  • Therefore, although each day may have different production totals, the intraday patterns are similar and seem correct.

We repeated the analysis on plant 2

dc_constante_p2 = df[df.planta == 'p2'].groupby(['planta','date','hora']).kw_dc.sum()
dc_constante_p2
planta  date        hora
p2      2020-05-15  0                 0.00
                    1                 0.00
                    2                 0.00
                    3                 0.00
                    4                 0.00
                                 ...      
        2020-06-17  19                0.00
                    20                0.00
                    21                0.00
                    22                0.00
                    23                0.00
Name: kw_dc, Length: 816, dtype: float64

We are going to pass date to columns, to be able to represent each column (which are the dates) as a variable and therefore as an independent graph.

dc_constante_p2.unstack(level = 1).plot(subplots = True, layout = (17,2), sharex=True, figsize=(20,30));

png

Conclusions:

  • Again on May 20 he appears with a strange behavior
  • Production levels are constant over the days, but always about 10 times below plant 1 levels

INSIGHT #2: The low levels of plant 2 are constant and have daily curves that seem normal.

Is the DC to AC conversion generated correctly?

sns.scatterplot(data = df, x = df.kw_dc, y = df.kw_ac, hue = df.planta);

png

Again the patterns are very clear: plant 2 transforms the current much more efficiently.

We are going to expand by analyzing the efficiency variable that we had created.

temp = df.groupby(['planta','hora'],as_index = False).eficiencia.mean()
temp
planta hora eficiencia
0 p1 0 0.00
1 p1 1 0.00
2 p1 2 0.00
3 p1 3 0.00
4 p1 4 0.00
5 p1 5 0.00
6 p1 6 9.20
7 p1 7 9.80
8 p1 8 9.81
9 p1 9 9.78
10 p1 10 9.77
11 p1 11 9.75
12 p1 12 9.71
13 p1 13 9.68
14 p1 14 9.76
15 p1 15 9.79
16 p1 16 9.81
17 p1 17 9.76
18 p1 18 6.53
19 p1 19 0.00
20 p1 20 0.00
21 p1 21 0.00
22 p1 22 0.00
23 p1 23 0.00
24 p2 0 0.00
25 p2 1 0.00
26 p2 2 0.00
27 p2 3 0.00
28 p2 4 0.00
29 p2 5 0.58
30 p2 6 92.93
31 p2 7 97.39
32 p2 8 97.42
33 p2 9 91.70
34 p2 10 76.20
35 p2 11 68.35
36 p2 12 68.16
37 p2 13 73.04
38 p2 14 83.43
39 p2 15 95.69
40 p2 16 95.85
41 p2 17 95.56
42 p2 18 73.52
43 p2 19 0.00
44 p2 20 0.00
45 p2 21 0.00
46 p2 22 0.00
47 p2 23 0.00
sns.lineplot(data = temp, x = 'hora', y = 'eficiencia', hue = 'planta');

png

INSIGHT #3

Plant 1 has a very low capacity to transform DC to AC, which suggests problems with the inverters

Other conclusions:

  • Go into the details of the inverters on floor 1, to see if they are all or there are some that bias the average
  • Review why plant 2 loses efficiency during the hours of most irradiation

We are going to start with the second, comparing the production of DC with that of AC in plant 2.

temp = df[['planta','hora','kw_dc','kw_ac']].melt(id_vars= ['planta','hora'])
temp
planta hora variable value
0 p1 0 kw_dc 0.00
1 p1 0 kw_dc 0.00
2 p1 0 kw_dc 0.00
3 p1 0 kw_dc 0.00
4 p1 0 kw_dc 0.00
... ... ... ... ...
272939 p2 23 kw_ac 0.00
272940 p2 23 kw_ac 0.00
272941 p2 23 kw_ac 0.00
272942 p2 23 kw_ac 0.00
272943 p2 23 kw_ac 0.00

272944 rows × 4 columns

plt.figure(figsize = (12,8))
sns.lineplot(data = temp[temp.planta == 'p2'], x = 'hora', y = 'value', hue = 'variable', ci = False);

png

We see that indeed in the central hours there is a loss of efficiency. But nowhere near the level of loss that we had seen in the previous analysis.

We are going to analyze the distribution of efficiency in those hours.

temp = df.between_time('08:00:00','15:00:00')
temp = temp[temp.planta == 'p2']
temp.eficiencia.plot.density();

png

There is a data set with zero efficiency, and that is what causes the problem. But what is the cause of that zero efficiency?

We are going to select those cases and review them.

temp[temp.kw_dc == 0]
planta mes dia hora minuto sensor_id irradiacion t_ambiente t_modulo inverter_id kw_dc kw_ac eficiencia kw_dia kw_total date
fecha
2020-05-15 09:45:00 p2 5 15 9 45 iq8k7ZNt4Mwm3w0 0.80 31.38 45.72 81aHJ1q11NBPMrL 0.00 0.00 0.00 1645.00 1215280381.00 2020-05-15
2020-05-15 09:45:00 p2 5 15 9 45 iq8k7ZNt4Mwm3w0 0.80 31.38 45.72 Et9kgGMDl729KT4 0.00 0.00 0.00 1541.00 1705791.00 2020-05-15
2020-05-15 09:45:00 p2 5 15 9 45 iq8k7ZNt4Mwm3w0 0.80 31.38 45.72 Quc1TzYxW2pYoWX 0.00 0.00 0.00 1324.00 329510409.00 2020-05-15
2020-05-15 09:45:00 p2 5 15 9 45 iq8k7ZNt4Mwm3w0 0.80 31.38 45.72 xoJJ8DcxJEcupym 0.00 0.00 0.00 1735.00 209145328.00 2020-05-15
2020-05-15 10:00:00 p2 5 15 10 0 iq8k7ZNt4Mwm3w0 0.83 31.89 46.13 81aHJ1q11NBPMrL 0.00 0.00 0.00 1645.00 1215280381.00 2020-05-15
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-16 14:30:00 p2 6 16 14 30 iq8k7ZNt4Mwm3w0 0.56 30.83 44.38 q49J1IKaHRwDQnt 0.00 0.00 0.00 0.00 34379.33 2020-06-16
2020-06-16 14:30:00 p2 6 16 14 30 iq8k7ZNt4Mwm3w0 0.56 30.83 44.38 rrq4fwE8jgrTyWY 0.00 0.00 0.00 0.00 8075096.33 2020-06-16
2020-06-16 14:30:00 p2 6 16 14 30 iq8k7ZNt4Mwm3w0 0.56 30.83 44.38 vOuJvMaM2sgwLmb 0.00 0.00 0.00 0.00 173022.79 2020-06-16
2020-06-16 14:30:00 p2 6 16 14 30 iq8k7ZNt4Mwm3w0 0.56 30.83 44.38 xMbIugepa2P7lBB 0.00 0.00 0.00 0.00 7126074.93 2020-06-16
2020-06-16 14:30:00 p2 6 16 14 30 iq8k7ZNt4Mwm3w0 0.56 30.83 44.38 xoJJ8DcxJEcupym 0.00 0.00 0.00 0.00 13955359.33 2020-06-16

3667 rows × 16 columns

Parece que no es problema del inverter, si no de que en esos momentos no se ha generado DC.

Vamos a poner la condición de que DC > 0 y ver ahí cual es la eficiencia.

temp[temp.kw_dc > 0].eficiencia.plot.density();

png

Indeed when there is DC the efficiency is higher than 96%.

The question then is why is there no DC? Is there a pattern?

Let's create a DC = 0 flag so we can analyze it.

temp['kw_dc_cero'] = np.where(temp['kw_dc'] == 0, 1, 0)
temp
planta mes dia hora minuto sensor_id irradiacion t_ambiente t_modulo inverter_id kw_dc kw_ac eficiencia kw_dia kw_total date kw_dc_cero
fecha
2020-05-15 08:00:00 p2 5 15 8 0 iq8k7ZNt4Mwm3w0 0.46 27.68 35.71 4UPUqMRk7TRMgml 581.05 569.41 98.00 554.00 2429565.00 2020-05-15 0
2020-05-15 08:00:00 p2 5 15 8 0 iq8k7ZNt4Mwm3w0 0.46 27.68 35.71 81aHJ1q11NBPMrL 534.67 524.09 98.02 516.60 1215279252.60 2020-05-15 0
2020-05-15 08:00:00 p2 5 15 8 0 iq8k7ZNt4Mwm3w0 0.46 27.68 35.71 9kRcWv60rDACzjR 568.53 557.16 98.00 551.80 2247720128.80 2020-05-15 0
2020-05-15 08:00:00 p2 5 15 8 0 iq8k7ZNt4Mwm3w0 0.46 27.68 35.71 Et9kgGMDl729KT4 526.24 515.83 98.02 503.00 1704753.00 2020-05-15 0
2020-05-15 08:00:00 p2 5 15 8 0 iq8k7ZNt4Mwm3w0 0.46 27.68 35.71 IQ2d7wF4YD8zU1Q 578.17 566.58 97.99 552.79 19942078.79 2020-05-15 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-17 15:00:00 p2 6 17 15 0 iq8k7ZNt4Mwm3w0 0.36 27.23 29.35 q49J1IKaHRwDQnt 562.34 551.13 98.01 3380.87 519981.87 2020-06-17 0
2020-06-17 15:00:00 p2 6 17 15 0 iq8k7ZNt4Mwm3w0 0.36 27.23 29.35 rrq4fwE8jgrTyWY 534.27 523.71 98.02 3190.40 121130615.40 2020-06-17 0
2020-06-17 15:00:00 p2 6 17 15 0 iq8k7ZNt4Mwm3w0 0.36 27.23 29.35 vOuJvMaM2sgwLmb 578.85 567.15 97.98 3534.07 2426903.07 2020-06-17 0
2020-06-17 15:00:00 p2 6 17 15 0 iq8k7ZNt4Mwm3w0 0.36 27.23 29.35 xMbIugepa2P7lBB 568.86 557.49 98.00 3433.50 106895609.50 2020-06-17 0
2020-06-17 15:00:00 p2 6 17 15 0 iq8k7ZNt4Mwm3w0 0.36 27.23 29.35 xoJJ8DcxJEcupym 578.49 566.90 98.00 3534.67 209334959.67 2020-06-17 0

20376 rows × 17 columns

We will start with the numerical variables.

temp.groupby('kw_dc_cero')[['irradiacion','t_ambiente','t_modulo']].mean()
irradiacion t_ambiente t_modulo
kw_dc_cero
0 0.59 30.48 44.32
1 0.80 32.48 52.03

At room temperature there is not much difference, but at module temperature and irradiation yes.

Could it be that if it gets too hot the module stops generating DC?

Let's see it by comparing the module temperature with the DC generation.

sns.scatterplot(data = temp, x = 't_modulo', y = 'kw_dc',hue = 'kw_dc_cero');

png

The previous hypothesis is not confirmed, since there are many cases of high temperatures where DC is generated, and also of kw_dc equal to zero in almost all temperature ranges.

We are now going to analyze the categorical ones, starting with the inverter.

temp.groupby('inverter_id').kw_dc_cero.mean().sort_values(ascending = False).plot.bar();

png

There is a big difference in the percentage of zero DC production per inverter.

From some that have less than 5% to some that exceed 30%.

INSIGHT #4:: On plant 2 there are several inverters that are not getting enough DC production, and therefore whose modules need revision.

We are going to analyze the inverters from the point of view of the average efficiency to see if there are "good and bad".

temp[temp.kw_dc > 0].groupby(['inverter_id','date'],as_index = False).eficiencia.mean().boxplot(column = 'eficiencia', by = 'inverter_id', figsize = (14,10))
plt.xticks(rotation = 90);

png

INSIGHT #5:: Once discounting the problem of not generating DC, the inverters of plant 2 do work well and do the job of transformation to AC well.

To finish analyzing the efficiency of the inverters, we can see their performance on each of the days to see if there may have been specific problems.

temp[temp.kw_dc > 0].groupby(['inverter_id','date']).eficiencia.mean().unstack(level = 0).plot(subplots = True, sharex=True, figsize=(20,40))
plt.xticks(rotation = 90);

png

To have a term of comparison we are going to repeat the analyzes with plant 1.

temp = df.between_time('08:00:00','15:00:00')
temp = temp[temp.planta == 'p1']
temp['kw_dc_cero'] = np.where(temp['kw_dc'] == 0, 1, 0)
temp
planta mes dia hora minuto sensor_id irradiacion t_ambiente t_modulo inverter_id kw_dc kw_ac eficiencia kw_dia kw_total date kw_dc_cero
fecha
2020-05-15 08:00:00 p1 5 15 8 0 HmiyD2TTLFNqkNe 0.20 25.42 31.41 1BY6WEcLGh8j5v7 3246.00 318.67 9.82 263.57 6259822.57 2020-05-15 0
2020-05-15 08:00:00 p1 5 15 8 0 HmiyD2TTLFNqkNe 0.20 25.42 31.41 1IF53ai7Xc0U56Y 2805.62 275.46 9.82 292.50 6183937.50 2020-05-15 0
2020-05-15 08:00:00 p1 5 15 8 0 HmiyD2TTLFNqkNe 0.20 25.42 31.41 3PZuoBAID5Wc2HD 2736.12 268.62 9.82 287.75 6988046.75 2020-05-15 0
2020-05-15 08:00:00 p1 5 15 8 0 HmiyD2TTLFNqkNe 0.20 25.42 31.41 7JYdWkrLSPkdwr4 2741.50 269.15 9.82 281.00 7603241.00 2020-05-15 0
2020-05-15 08:00:00 p1 5 15 8 0 HmiyD2TTLFNqkNe 0.20 25.42 31.41 McdE0feGgRqW7Ca 3208.88 315.05 9.82 291.00 7159255.00 2020-05-15 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-17 15:00:00 p1 6 17 15 0 HmiyD2TTLFNqkNe 0.32 28.62 39.95 uHbuxQJl8lW7ozc 4719.62 462.89 9.81 5423.50 7286458.50 2020-06-17 0
2020-06-17 15:00:00 p1 6 17 15 0 HmiyD2TTLFNqkNe 0.32 28.62 39.95 wCURE6d3bPkepu2 5077.75 497.80 9.80 5343.75 7028061.75 2020-06-17 0
2020-06-17 15:00:00 p1 6 17 15 0 HmiyD2TTLFNqkNe 0.32 28.62 39.95 z9Y9gH1T5YWrNuG 5113.00 501.29 9.80 5282.88 7250667.88 2020-06-17 0
2020-06-17 15:00:00 p1 6 17 15 0 HmiyD2TTLFNqkNe 0.32 28.62 39.95 zBIq5rxdHJRwDNY 4675.38 458.61 9.81 5284.75 6582836.75 2020-06-17 0
2020-06-17 15:00:00 p1 6 17 15 0 HmiyD2TTLFNqkNe 0.32 28.62 39.95 zVJPv84UY57bAof 4853.75 475.96 9.81 5368.75 7362730.75 2020-06-17 0

21450 rows × 17 columns

temp.eficiencia.plot.density();

png

We see that no, here all the inverters have a constant efficiency (although very low)

temp.groupby(['inverter_id','date'],as_index = False).eficiencia.mean().boxplot(column = 'eficiencia', by = 'inverter_id', figsize = (14,10))
plt.xticks(rotation = 90);

png

We see that except for specific days in some inverters, in the rest the efficiency is constant.

We are going to review the average daily efficiency for each inverter.

temp.groupby(['inverter_id','date']).eficiencia.mean().unstack(level = 0).plot(subplots = True, sharex=True, figsize=(20,40))
plt.xticks(rotation = 90);

png

In the inverter analysis we see again that all the data are constant.

We are going to verify that then there are no failures in the generation of DC.

temp.groupby('inverter_id').kw_dc_cero.mean().sort_values(ascending = False).plot.bar();

png

We see that although there are some inverters that have had failures, their magnitude is less than 2% of the measurements.

Therefore the generation of DC in plant 1 is correct, and the failure is in the transformation from DC to AC.

CONCLUSIONS

After an analysis of the data we can conclude that:

  • There are serious data quality problems. It should be reviewed in which part of the chain these problems are generated, including the meters of the plants.
  • The fact that the DC generation is about 10 times higher in plant 1 than in plant 2, added to the fact that the efficiency in plant 1 is over 10% leads us to think that the DC generation data on floor 1 it may be artificially scaled for some reason.
  • But for now, in the absence of verification, we will assume that the data is correct.
  • The two plants have received high amounts of irradiation, we have not located any problem at this stage
  • Although the ambient temperature is higher on floor 2 and its modules get hotter than those on floor 1, this does not seem to have a significant impact
  • Plant 1 DC generation works fine, the modules seem to bring DC to the inverters.
  • The DC generation of plant 2 does NOT work well, some modules carry very little DC to the inverters even in the hours of greatest irradiation.
  • The transformation from DC to AC of floor 1 does NOT work well, it only transforms around 10%, yes, constantly. And this low efficiency is not due to moments of non-reception of DC nor is it concentrated in specific inverters, but rather it seems more structural (again keep in mind that it could be due to a data quality problem in kw_dc of plant 1
  • The transformation from DC to AC of plant 2 works well, since once the periods of zero DC generation are eliminated, the rest have an efficiency greater than 97%

Recommendations:

  • Review the data collection and its reliability
  • Maintenance check on the inverter modules of plant 2 in which there are many moments of zero DC generation
  • Maintenance review of the inverters of Plant 1