In this case we will be working for a photovoltaic solar power generation company.
Anomalous behaviors have been detected in 2 of the plants and the maintenance subcontractor is unable to identify the reason.
Before deploying a team of engineers, they ask the data science team to analyze the data from the sensors and meters to see if we can detect the problem.
In this case, among other things, we are going to learn:
Therefore, much of what we learn here is generally applicable to industry and IoT projects:
Analyze the available data to try to intuit where the problems may be and whether or not it is necessary to send a team of engineers to the plants.
In this type of project in which there is a clear process, the most IMPORTANT part is to know and understand that process.
Let's see, for example, how in this case, which seems easy a priori due to the apparent simplicity of the data, if we do not design the project guided by the process, we could get into an infinite loop of analysis without getting anywhere.
Once we understand how the business and the process work, the levers will come out on their own.
How does a photovoltaic solar power plant work?
Therefore, the levers that influence the business objective (in this case, generate AC current) are:
To determine the entities it is necessary to know what a solar plant is composed of.
The minimum unit is the cell, it is there where the generation of energy by reaction with the photons of the sun takes place.
The cells are encapsulated in "rectangles" called modules.
Several modules form a panel.
The panels are arranged in rows called arrays.
An inverter receives direct current from several arrays.
A plant can have several inverters.
There are also the meters and sensors, which may be one or more.
In our case, the entities that we have in the granularity of the data are:
This conditions that we will be able to know, for example, if an inverter in a plant has lower performance than expected, but we will not know which array, panel or module may be causing it.
Having understood the levers, kpis and entities, we can now ask the seed questions:
About irradiation:
About the plants:
About the DC generation:
About AC generation:
About meters and sensors:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#Automcompletar rápido
%config IPCompleter.greedy=True
import fiser_tools as fs
fs.misc.dark_theme()
This case consists of 4 files:
p1g = pd.read_csv('Plant_1_Generation_Data.csv')
p1g
DATE_TIME | PLANT_ID | SOURCE_KEY | DC_POWER | AC_POWER | DAILY_YIELD | TOTAL_YIELD | |
---|---|---|---|---|---|---|---|
0 | 15-05-2020 00:00 | 4135001 | 1BY6WEcLGh8j5v7 | 0.0 | 0.0 | 0.000 | 6259559.0 |
1 | 15-05-2020 00:00 | 4135001 | 1IF53ai7Xc0U56Y | 0.0 | 0.0 | 0.000 | 6183645.0 |
2 | 15-05-2020 00:00 | 4135001 | 3PZuoBAID5Wc2HD | 0.0 | 0.0 | 0.000 | 6987759.0 |
3 | 15-05-2020 00:00 | 4135001 | 7JYdWkrLSPkdwr4 | 0.0 | 0.0 | 0.000 | 7602960.0 |
4 | 15-05-2020 00:00 | 4135001 | McdE0feGgRqW7Ca | 0.0 | 0.0 | 0.000 | 7158964.0 |
... | ... | ... | ... | ... | ... | ... | ... |
68773 | 17-06-2020 23:45 | 4135001 | uHbuxQJl8lW7ozc | 0.0 | 0.0 | 5967.000 | 7287002.0 |
68774 | 17-06-2020 23:45 | 4135001 | wCURE6d3bPkepu2 | 0.0 | 0.0 | 5147.625 | 7028601.0 |
68775 | 17-06-2020 23:45 | 4135001 | z9Y9gH1T5YWrNuG | 0.0 | 0.0 | 5819.000 | 7251204.0 |
68776 | 17-06-2020 23:45 | 4135001 | zBIq5rxdHJRwDNY | 0.0 | 0.0 | 5817.000 | 6583369.0 |
68777 | 17-06-2020 23:45 | 4135001 | zVJPv84UY57bAof | 0.0 | 0.0 | 5910.000 | 7363272.0 |
68778 rows × 7 columns
p1w = pd.read_csv('Plant_1_Weather_Sensor_Data.csv')
p1w
DATE_TIME | PLANT_ID | SOURCE_KEY | AMBIENT_TEMPERATURE | MODULE_TEMPERATURE | IRRADIATION | |
---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | 4135001 | HmiyD2TTLFNqkNe | 25.184316 | 22.857507 | 0.0 |
1 | 2020-05-15 00:15:00 | 4135001 | HmiyD2TTLFNqkNe | 25.084589 | 22.761668 | 0.0 |
2 | 2020-05-15 00:30:00 | 4135001 | HmiyD2TTLFNqkNe | 24.935753 | 22.592306 | 0.0 |
3 | 2020-05-15 00:45:00 | 4135001 | HmiyD2TTLFNqkNe | 24.846130 | 22.360852 | 0.0 |
4 | 2020-05-15 01:00:00 | 4135001 | HmiyD2TTLFNqkNe | 24.621525 | 22.165423 | 0.0 |
... | ... | ... | ... | ... | ... | ... |
3177 | 2020-06-17 22:45:00 | 4135001 | HmiyD2TTLFNqkNe | 22.150570 | 21.480377 | 0.0 |
3178 | 2020-06-17 23:00:00 | 4135001 | HmiyD2TTLFNqkNe | 22.129816 | 21.389024 | 0.0 |
3179 | 2020-06-17 23:15:00 | 4135001 | HmiyD2TTLFNqkNe | 22.008275 | 20.709211 | 0.0 |
3180 | 2020-06-17 23:30:00 | 4135001 | HmiyD2TTLFNqkNe | 21.969495 | 20.734963 | 0.0 |
3181 | 2020-06-17 23:45:00 | 4135001 | HmiyD2TTLFNqkNe | 21.909288 | 20.427972 | 0.0 |
3182 rows × 6 columns
p2g = pd.read_csv('Plant_2_Generation_Data.csv')
p2g
DATE_TIME | PLANT_ID | SOURCE_KEY | DC_POWER | AC_POWER | DAILY_YIELD | TOTAL_YIELD | |
---|---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | 4136001 | 4UPUqMRk7TRMgml | 0.0 | 0.0 | 9425.000000 | 2.429011e+06 |
1 | 2020-05-15 00:00:00 | 4136001 | 81aHJ1q11NBPMrL | 0.0 | 0.0 | 0.000000 | 1.215279e+09 |
2 | 2020-05-15 00:00:00 | 4136001 | 9kRcWv60rDACzjR | 0.0 | 0.0 | 3075.333333 | 2.247720e+09 |
3 | 2020-05-15 00:00:00 | 4136001 | Et9kgGMDl729KT4 | 0.0 | 0.0 | 269.933333 | 1.704250e+06 |
4 | 2020-05-15 00:00:00 | 4136001 | IQ2d7wF4YD8zU1Q | 0.0 | 0.0 | 3177.000000 | 1.994153e+07 |
... | ... | ... | ... | ... | ... | ... | ... |
67693 | 2020-06-17 23:45:00 | 4136001 | q49J1IKaHRwDQnt | 0.0 | 0.0 | 4157.000000 | 5.207580e+05 |
67694 | 2020-06-17 23:45:00 | 4136001 | rrq4fwE8jgrTyWY | 0.0 | 0.0 | 3931.000000 | 1.211314e+08 |
67695 | 2020-06-17 23:45:00 | 4136001 | vOuJvMaM2sgwLmb | 0.0 | 0.0 | 4322.000000 | 2.427691e+06 |
67696 | 2020-06-17 23:45:00 | 4136001 | xMbIugepa2P7lBB | 0.0 | 0.0 | 4218.000000 | 1.068964e+08 |
67697 | 2020-06-17 23:45:00 | 4136001 | xoJJ8DcxJEcupym | 0.0 | 0.0 | 4316.000000 | 2.093357e+08 |
67698 rows × 7 columns
p2w = pd.read_csv('Plant_2_Weather_Sensor_Data.csv')
p2w
DATE_TIME | PLANT_ID | SOURCE_KEY | AMBIENT_TEMPERATURE | MODULE_TEMPERATURE | IRRADIATION | |
---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 27.004764 | 25.060789 | 0.0 |
1 | 2020-05-15 00:15:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 26.880811 | 24.421869 | 0.0 |
2 | 2020-05-15 00:30:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 26.682055 | 24.427290 | 0.0 |
3 | 2020-05-15 00:45:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 26.500589 | 24.420678 | 0.0 |
4 | 2020-05-15 01:00:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 26.596148 | 25.088210 | 0.0 |
... | ... | ... | ... | ... | ... | ... |
3254 | 2020-06-17 22:45:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 23.511703 | 22.856201 | 0.0 |
3255 | 2020-06-17 23:00:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 23.482282 | 22.744190 | 0.0 |
3256 | 2020-06-17 23:15:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 23.354743 | 22.492245 | 0.0 |
3257 | 2020-06-17 23:30:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 23.291048 | 22.373909 | 0.0 |
3258 | 2020-06-17 23:45:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 23.202871 | 22.535908 | 0.0 |
3259 rows × 6 columns
We start with the overview.
p1g.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68778 entries, 0 to 68777
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE_TIME 68778 non-null object
1 PLANT_ID 68778 non-null int64
2 SOURCE_KEY 68778 non-null object
3 DC_POWER 68778 non-null float64
4 AC_POWER 68778 non-null float64
5 DAILY_YIELD 68778 non-null float64
6 TOTAL_YIELD 68778 non-null float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.7+ MB
We see that there are no nulls.
We see that DATE_TIME is as object.
We convert DATE_TIME to type datetime.
p1g['DATE_TIME'] = pd.to_datetime(p1g.DATE_TIME,dayfirst=True)
We reviewed a sample of data.
p1g.head()
DATE_TIME | PLANT_ID | SOURCE_KEY | DC_POWER | AC_POWER | DAILY_YIELD | TOTAL_YIELD | |
---|---|---|---|---|---|---|---|
0 | 2020-05-15 | 4135001 | 1BY6WEcLGh8j5v7 | 0.0 | 0.0 | 0.0 | 6259559.0 |
1 | 2020-05-15 | 4135001 | 1IF53ai7Xc0U56Y | 0.0 | 0.0 | 0.0 | 6183645.0 |
2 | 2020-05-15 | 4135001 | 3PZuoBAID5Wc2HD | 0.0 | 0.0 | 0.0 | 6987759.0 |
3 | 2020-05-15 | 4135001 | 7JYdWkrLSPkdwr4 | 0.0 | 0.0 | 0.0 | 7602960.0 |
4 | 2020-05-15 | 4135001 | McdE0feGgRqW7Ca | 0.0 | 0.0 | 0.0 | 7158964.0 |
We check that the plant identifier is unique.
p1g.PLANT_ID.unique()
array([4135001], dtype=int64)
Let's replace it with a more readable literal.
p1g['PLANT_ID'] = p1g.PLANT_ID.replace(4135001, 'p1')
We review the descriptions.
p1g.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
DC_POWER | 68778.0 | 3.147426e+03 | 4036.457169 | 0.0 | 0.000000e+00 | 4.290000e+02 | 6.366964e+03 | 14471.125 |
AC_POWER | 68778.0 | 3.078028e+02 | 394.396439 | 0.0 | 0.000000e+00 | 4.149375e+01 | 6.236187e+02 | 1410.950 |
DAILY_YIELD | 68778.0 | 3.295969e+03 | 3145.178309 | 0.0 | 0.000000e+00 | 2.658714e+03 | 6.274000e+03 | 9163.000 |
TOTAL_YIELD | 68778.0 | 6.978712e+06 | 416271.982856 | 6183645.0 | 6.512003e+06 | 7.146685e+06 | 7.268706e+06 | 7846821.000 |
Let's remove the scientific notation display.
pd.options.display.float_format = '{:15.2f}'.format
p1g.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
DC_POWER | 68778.00 | 3147.43 | 4036.46 | 0.00 | 0.00 | 429.00 | 6366.96 | 14471.12 |
AC_POWER | 68778.00 | 307.80 | 394.40 | 0.00 | 0.00 | 41.49 | 623.62 | 1410.95 |
DAILY_YIELD | 68778.00 | 3295.97 | 3145.18 | 0.00 | 0.00 | 2658.71 | 6274.00 | 9163.00 |
TOTAL_YIELD | 68778.00 | 6978711.76 | 416271.98 | 6183645.00 | 6512002.54 | 7146685.00 | 7268705.91 | 7846821.00 |
The difference in means between DC and AC is strange.
Let's visualize it.
p1g[['DC_POWER','AC_POWER']].plot(figsize = (16,12));
The diference is huge.
First we are going to check if they go in the same direction even if it is on a different scale (with a correlation), and then we are going to check what is the average ratio between both measures.
p1g.DC_POWER.corr(p1g.AC_POWER)
0.9999962553331414
(p1g.DC_POWER / p1g.AC_POWER).describe()
count 36827.00
mean 10.23
std 0.05
min 9.38
25% 10.20
50% 10.22
75% 10.25
max 10.47
dtype: float64
It seems that the Inverters are transforming only 10% from DC to AC, which a priori is very low.
In any case, from quality we got here and we will continue exploring this in the analysis part and comparing it with Plant 2 to see if the same thing happens.
We analyze the categorical variable, which is the identifier of the inverters.
p1g.SOURCE_KEY.nunique()
22
p1g.SOURCE_KEY.value_counts()
bvBOhCH3iADSZry 3155
1BY6WEcLGh8j5v7 3154
7JYdWkrLSPkdwr4 3133
VHMLBKoKgIrUVDU 3133
ZnxXDlPa8U1GXgE 3130
ih0vzX44oOqAx2f 3130
z9Y9gH1T5YWrNuG 3126
wCURE6d3bPkepu2 3126
uHbuxQJl8lW7ozc 3125
pkci93gMrogZuBj 3125
iCRJl6heRkivqQ3 3125
rGa61gmuvPhdLxV 3124
sjndEbLyjtCKgGv 3124
McdE0feGgRqW7Ca 3124
zVJPv84UY57bAof 3124
ZoEaEvLYb1n2sOq 3123
1IF53ai7Xc0U56Y 3119
adLQvlD726eNBSB 3119
zBIq5rxdHJRwDNY 3119
WRmjgnKYAwPKWDb 3118
3PZuoBAID5Wc2HD 3118
YxYtjZvoooNbGkE 3104
Name: SOURCE_KEY, dtype: int64
Conclusions:
* plant 1 has 22 inverters
* All have a similar number of measures although not exactly the same
* They could be stops due to maintenance, or simple data loss, but we write it down for the analysis phase
We are going to analyze the DAILY_YIELD variables, since the metadata tells us that the TOTAL_YIELD variable is the accumulated total per inverter, but in DAILY_YIELD it does not specify it, so we do not know if it is accumulated per inverter or per plant.
The hypothesis is the following: if it is per plant, there should be no differences between the data from the different inverters at the same specific moment.
And therefore if we see that there are differences then it is that the data is due to the inverter.
To verify it, it is useful to take a sample of inverters.
seleccion = list(p1g.SOURCE_KEY.unique()[:5])
temp = p1g[p1g.SOURCE_KEY.isin(seleccion)].set_index('DATE_TIME')
temp
PLANT_ID | SOURCE_KEY | DC_POWER | AC_POWER | DAILY_YIELD | TOTAL_YIELD | |
---|---|---|---|---|---|---|
DATE_TIME | ||||||
2020-05-15 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 6259559.00 |
2020-05-15 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6183645.00 |
2020-05-15 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 6987759.00 |
2020-05-15 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7602960.00 |
2020-05-15 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7158964.00 |
... | ... | ... | ... | ... | ... | ... |
2020-06-17 23:45:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 5521.00 | 6485319.00 |
2020-06-17 23:45:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 6034.00 | 6433566.00 |
2020-06-17 23:45:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 6052.00 | 7237425.00 |
2020-06-17 23:45:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 5856.00 | 7846821.00 |
2020-06-17 23:45:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 5992.00 | 7408587.00 |
15648 rows × 6 columns
In the data we already see that it is different, but we are going to check on more data so that it is not an effect of those specific records.
We are going to see it graphically, and for simplicity we are going to take only a sample of days.
Since we have the date as index we remember that we can use partial and slice indexing.
temp = temp.loc['2020-06-01':'2020-06-05']
temp
PLANT_ID | SOURCE_KEY | DC_POWER | AC_POWER | DAILY_YIELD | TOTAL_YIELD | |
---|---|---|---|---|---|---|
DATE_TIME | ||||||
2020-06-01 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 829.00 | 6377931.00 |
2020-06-01 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6311432.00 |
2020-06-01 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 7115304.00 |
2020-06-01 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 730.38 | 7727821.00 |
2020-06-01 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7286760.00 |
... | ... | ... | ... | ... | ... | ... |
2020-06-05 23:45:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 7162.00 | 6412542.00 |
2020-06-05 23:45:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6348557.00 |
2020-06-05 23:45:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 7152486.00 |
2020-06-05 23:45:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7764140.00 |
2020-06-05 23:45:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 4950.00 | 7324681.00 |
2370 rows × 6 columns
plt.figure(figsize = (16,12))
sns.lineplot(data = temp.reset_index(), x = temp.reset_index().DATE_TIME, y = 'DAILY_YIELD', hue = 'SOURCE_KEY');
Definitely different inverters have different data at the same time, so we conclude that this variable is per inverter
Finally we are going to analyze the period in which we have data and if the number of daily measurements is constant.
p1g.DATE_TIME.dt.date.value_counts().sort_index().plot.bar(figsize = (12,8));
Conclusions:
* The data period is between May 15, 2020 and June 17, 2020
* We have data for every day, there are no intermediate days missing
* But some days like 05/21 or 05/29 have fewer measurements
* So it doesn't look 100% regular
p1w.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3182 entries, 0 to 3181
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE_TIME 3182 non-null object
1 PLANT_ID 3182 non-null int64
2 SOURCE_KEY 3182 non-null object
3 AMBIENT_TEMPERATURE 3182 non-null float64
4 MODULE_TEMPERATURE 3182 non-null float64
5 IRRADIATION 3182 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 149.3+ KB
We correct the type of DATE_TIME
p1w.DATE_TIME = pd.to_datetime(p1w.DATE_TIME)
p1w.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3182 entries, 0 to 3181
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE_TIME 3182 non-null datetime64[ns]
1 PLANT_ID 3182 non-null int64
2 SOURCE_KEY 3182 non-null object
3 AMBIENT_TEMPERATURE 3182 non-null float64
4 MODULE_TEMPERATURE 3182 non-null float64
5 IRRADIATION 3182 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 149.3+ KB
p1w.head()
DATE_TIME | PLANT_ID | SOURCE_KEY | AMBIENT_TEMPERATURE | MODULE_TEMPERATURE | IRRADIATION | |
---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | 4135001 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
1 | 2020-05-15 00:15:00 | 4135001 | HmiyD2TTLFNqkNe | 25.08 | 22.76 | 0.00 |
2 | 2020-05-15 00:30:00 | 4135001 | HmiyD2TTLFNqkNe | 24.94 | 22.59 | 0.00 |
3 | 2020-05-15 00:45:00 | 4135001 | HmiyD2TTLFNqkNe | 24.85 | 22.36 | 0.00 |
4 | 2020-05-15 01:00:00 | 4135001 | HmiyD2TTLFNqkNe | 24.62 | 22.17 | 0.00 |
We replaced the name of the plant.
p1w['PLANT_ID'] = p1w.PLANT_ID.replace(4135001,'p1')
p1w
DATE_TIME | PLANT_ID | SOURCE_KEY | AMBIENT_TEMPERATURE | MODULE_TEMPERATURE | IRRADIATION | |
---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | p1 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
1 | 2020-05-15 00:15:00 | p1 | HmiyD2TTLFNqkNe | 25.08 | 22.76 | 0.00 |
2 | 2020-05-15 00:30:00 | p1 | HmiyD2TTLFNqkNe | 24.94 | 22.59 | 0.00 |
3 | 2020-05-15 00:45:00 | p1 | HmiyD2TTLFNqkNe | 24.85 | 22.36 | 0.00 |
4 | 2020-05-15 01:00:00 | p1 | HmiyD2TTLFNqkNe | 24.62 | 22.17 | 0.00 |
... | ... | ... | ... | ... | ... | ... |
3177 | 2020-06-17 22:45:00 | p1 | HmiyD2TTLFNqkNe | 22.15 | 21.48 | 0.00 |
3178 | 2020-06-17 23:00:00 | p1 | HmiyD2TTLFNqkNe | 22.13 | 21.39 | 0.00 |
3179 | 2020-06-17 23:15:00 | p1 | HmiyD2TTLFNqkNe | 22.01 | 20.71 | 0.00 |
3180 | 2020-06-17 23:30:00 | p1 | HmiyD2TTLFNqkNe | 21.97 | 20.73 | 0.00 |
3181 | 2020-06-17 23:45:00 | p1 | HmiyD2TTLFNqkNe | 21.91 | 20.43 | 0.00 |
3182 rows × 6 columns
We review the statistics.
p1w.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
AMBIENT_TEMPERATURE | 3182.00 | 25.53 | 3.35 | 20.40 | 22.71 | 24.61 | 27.92 | 35.25 |
MODULE_TEMPERATURE | 3182.00 | 31.09 | 12.26 | 18.14 | 21.09 | 24.62 | 41.31 | 65.55 |
IRRADIATION | 3182.00 | 0.23 | 0.30 | 0.00 | 0.00 | 0.02 | 0.45 | 1.22 |
We check the categorical variable, which is the sensor identifier.
p1w.SOURCE_KEY.nunique()
1
There is only one sensor for environmental variables in the plant.
We revise the date.
p1w.DATE_TIME.dt.date.value_counts().sort_index().plot.bar(figsize = (12,8));
Conclusions:
* The data period is between May 15, 2020 and June 17, 2020
* We have data for every day, there are no intermediate days missing
* But some days like 05/21 or 05/29 have fewer measurements
* So it doesn't look 100% regular
p2g.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67698 entries, 0 to 67697
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE_TIME 67698 non-null object
1 PLANT_ID 67698 non-null int64
2 SOURCE_KEY 67698 non-null object
3 DC_POWER 67698 non-null float64
4 AC_POWER 67698 non-null float64
5 DAILY_YIELD 67698 non-null float64
6 TOTAL_YIELD 67698 non-null float64
dtypes: float64(4), int64(1), object(2)
memory usage: 3.6+ MB
p2g['DATE_TIME'] = pd.to_datetime(p2g.DATE_TIME)
p2g['PLANT_ID'] = p2g.PLANT_ID.replace(4136001, 'p2')
p2g.head()
DATE_TIME | PLANT_ID | SOURCE_KEY | DC_POWER | AC_POWER | DAILY_YIELD | TOTAL_YIELD | |
---|---|---|---|---|---|---|---|
0 | 2020-05-15 | p2 | 4UPUqMRk7TRMgml | 0.00 | 0.00 | 9425.00 | 2429011.00 |
1 | 2020-05-15 | p2 | 81aHJ1q11NBPMrL | 0.00 | 0.00 | 0.00 | 1215278736.00 |
2 | 2020-05-15 | p2 | 9kRcWv60rDACzjR | 0.00 | 0.00 | 3075.33 | 2247719577.00 |
3 | 2020-05-15 | p2 | Et9kgGMDl729KT4 | 0.00 | 0.00 | 269.93 | 1704250.00 |
4 | 2020-05-15 | p2 | IQ2d7wF4YD8zU1Q | 0.00 | 0.00 | 3177.00 | 19941526.00 |
p2g.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
DC_POWER | 67698.00 | 246.70 | 370.57 | 0.00 | 0.00 | 0.00 | 446.59 | 1420.93 |
AC_POWER | 67698.00 | 241.28 | 362.11 | 0.00 | 0.00 | 0.00 | 438.22 | 1385.42 |
DAILY_YIELD | 67698.00 | 3294.89 | 2919.45 | 0.00 | 272.75 | 2911.00 | 5534.00 | 9873.00 |
TOTAL_YIELD | 67698.00 | 658944788.42 | 729667771.07 | 0.00 | 19964944.87 | 282627587.00 | 1348495113.00 | 2247916295.00 |
In this case the values of DC and AC are much closer to each other.
Let's calculate the ratio.
(p2g.DC_POWER / p2g.AC_POWER).describe()
count 32036.00
mean 1.02
std 0.01
min 0.99
25% 1.02
50% 1.02
75% 1.03
max 1.10
dtype: float64
Now the values of the ratio are very close to one.
We analyze the categorical variable, which is the identifier of the inverters.
p2g.SOURCE_KEY.nunique()
22
p2g.SOURCE_KEY.value_counts()
xoJJ8DcxJEcupym 3259
WcxssY2VbP4hApt 3259
9kRcWv60rDACzjR 3259
vOuJvMaM2sgwLmb 3259
rrq4fwE8jgrTyWY 3259
LYwnQax7tkwH5Cb 3259
LlT2YUhhzqhg5Sw 3259
q49J1IKaHRwDQnt 3259
oZZkBaNadn6DNKz 3259
PeE6FRyGXUgsRhN 3259
81aHJ1q11NBPMrL 3259
V94E5Ben1TlhnDV 3259
oZ35aAeoifZaQzV 3195
4UPUqMRk7TRMgml 3195
Qf4GUc1pJu5T6c6 3195
Mx2yZCDsyf6DPfv 3195
Et9kgGMDl729KT4 3195
Quc1TzYxW2pYoWX 3195
mqwcsP2rE7J0TFp 2355
NgDl19wMapZy17u 2355
IQ2d7wF4YD8zU1Q 2355
xMbIugepa2P7lBB 2355
Name: SOURCE_KEY, dtype: int64
Conclusiones:
* La planta 2 tiene 22 inverters
* Todos tienen un número similar de medidas aunque no exactamente igual
* A excepción de 4 que tienen unas 800 medidas menos
* Lo apuntamos para la fase de análisis
Por último vamos a analizar la fecha.
p2g.DATE_TIME.dt.date.value_counts().sort_index().plot.bar(figsize = (12,8));
Conclusiones:
* El período de datos es entre el 15 de Mayo del 2020 y el 17 de Junio de 2020
* Tenemos datos para todos los días, no falta ninguno intermedio
* Pero algunos días como el 20/05 y varios más tienen menos mediciones
* Por lo que no parece 100% regular
p2w.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3259 entries, 0 to 3258
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE_TIME 3259 non-null object
1 PLANT_ID 3259 non-null int64
2 SOURCE_KEY 3259 non-null object
3 AMBIENT_TEMPERATURE 3259 non-null float64
4 MODULE_TEMPERATURE 3259 non-null float64
5 IRRADIATION 3259 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 152.9+ KB
We correct the type of DATE_TIME
p2w.DATE_TIME = pd.to_datetime(p2w.DATE_TIME)
p2w.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3259 entries, 0 to 3258
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE_TIME 3259 non-null datetime64[ns]
1 PLANT_ID 3259 non-null int64
2 SOURCE_KEY 3259 non-null object
3 AMBIENT_TEMPERATURE 3259 non-null float64
4 MODULE_TEMPERATURE 3259 non-null float64
5 IRRADIATION 3259 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 152.9+ KB
p2w.head()
DATE_TIME | PLANT_ID | SOURCE_KEY | AMBIENT_TEMPERATURE | MODULE_TEMPERATURE | IRRADIATION | |
---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 27.00 | 25.06 | 0.00 |
1 | 2020-05-15 00:15:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 26.88 | 24.42 | 0.00 |
2 | 2020-05-15 00:30:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 26.68 | 24.43 | 0.00 |
3 | 2020-05-15 00:45:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 26.50 | 24.42 | 0.00 |
4 | 2020-05-15 01:00:00 | 4136001 | iq8k7ZNt4Mwm3w0 | 26.60 | 25.09 | 0.00 |
We replaced the name of the plant.
p2w['PLANT_ID'] = p2w.PLANT_ID.replace(4136001,'p2')
p2w
DATE_TIME | PLANT_ID | SOURCE_KEY | AMBIENT_TEMPERATURE | MODULE_TEMPERATURE | IRRADIATION | |
---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | p2 | iq8k7ZNt4Mwm3w0 | 27.00 | 25.06 | 0.00 |
1 | 2020-05-15 00:15:00 | p2 | iq8k7ZNt4Mwm3w0 | 26.88 | 24.42 | 0.00 |
2 | 2020-05-15 00:30:00 | p2 | iq8k7ZNt4Mwm3w0 | 26.68 | 24.43 | 0.00 |
3 | 2020-05-15 00:45:00 | p2 | iq8k7ZNt4Mwm3w0 | 26.50 | 24.42 | 0.00 |
4 | 2020-05-15 01:00:00 | p2 | iq8k7ZNt4Mwm3w0 | 26.60 | 25.09 | 0.00 |
... | ... | ... | ... | ... | ... | ... |
3254 | 2020-06-17 22:45:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.51 | 22.86 | 0.00 |
3255 | 2020-06-17 23:00:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.48 | 22.74 | 0.00 |
3256 | 2020-06-17 23:15:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.35 | 22.49 | 0.00 |
3257 | 2020-06-17 23:30:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.29 | 22.37 | 0.00 |
3258 | 2020-06-17 23:45:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
3259 rows × 6 columns
We review the statistics.
p2w.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
AMBIENT_TEMPERATURE | 3259.00 | 28.07 | 4.06 | 20.94 | 24.60 | 26.98 | 31.06 | 39.18 |
MODULE_TEMPERATURE | 3259.00 | 32.77 | 11.34 | 20.27 | 23.72 | 27.53 | 40.48 | 66.64 |
IRRADIATION | 3259.00 | 0.23 | 0.31 | 0.00 | 0.00 | 0.02 | 0.44 | 1.10 |
We analyze the categorical variable, which is the sensor identifier.
p2w.SOURCE_KEY.nunique()
1
There is only one sensor for environmental variables in the plant.
We revise the date.
p2w.DATE_TIME.dt.date.value_counts().sort_index().plot.bar(figsize = (12,8));
Conclusions:
* The data period is between May 15, 2020 and June 17, 2020
* We have data for every day, there are no intermediate days missing
* But some days like 05/15 or others have fewer measurements, although they are much less missing than in the other datasets
* But it doesn't look 100% regular
We are going to do a union by parts.
First the two generation datasets. Which will be a stack of records since the fields are the same.
After the two of environmental measures. Which will be a stack of records since the fields are the same.
And finally we will cross both partials through integration by key fields.
gener = pd.concat([p1g,p2g],axis = 'index')
gener
DATE_TIME | PLANT_ID | SOURCE_KEY | DC_POWER | AC_POWER | DAILY_YIELD | TOTAL_YIELD | |
---|---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 6259559.00 |
1 | 2020-05-15 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6183645.00 |
2 | 2020-05-15 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 6987759.00 |
3 | 2020-05-15 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7602960.00 |
4 | 2020-05-15 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7158964.00 |
... | ... | ... | ... | ... | ... | ... | ... |
67693 | 2020-06-17 23:45:00 | p2 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 4157.00 | 520758.00 |
67694 | 2020-06-17 23:45:00 | p2 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 3931.00 | 121131356.00 |
67695 | 2020-06-17 23:45:00 | p2 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 4322.00 | 2427691.00 |
67696 | 2020-06-17 23:45:00 | p2 | xMbIugepa2P7lBB | 0.00 | 0.00 | 4218.00 | 106896394.00 |
67697 | 2020-06-17 23:45:00 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 4316.00 | 209335741.00 |
136476 rows × 7 columns
Let's rename the variables now to make them more descriptive and usable.
gener.columns = ['fecha','planta','inverter_id','kw_dc','kw_ac','kw_dia','kw_total']
gener
fecha | planta | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | |
---|---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 6259559.00 |
1 | 2020-05-15 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6183645.00 |
2 | 2020-05-15 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 6987759.00 |
3 | 2020-05-15 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7602960.00 |
4 | 2020-05-15 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7158964.00 |
... | ... | ... | ... | ... | ... | ... | ... |
67693 | 2020-06-17 23:45:00 | p2 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 4157.00 | 520758.00 |
67694 | 2020-06-17 23:45:00 | p2 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 3931.00 | 121131356.00 |
67695 | 2020-06-17 23:45:00 | p2 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 4322.00 | 2427691.00 |
67696 | 2020-06-17 23:45:00 | p2 | xMbIugepa2P7lBB | 0.00 | 0.00 | 4218.00 | 106896394.00 |
67697 | 2020-06-17 23:45:00 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 4316.00 | 209335741.00 |
136476 rows × 7 columns
Now that we have the 2 plants together we are going to do what is called a consistency analysis, since according to the documentation kw_dia and kw_total are directly related to kw_dc and kw_ac.
We are going to try to replicate the data of kw_dia and kw_total.
gener2 = gener.copy()
Creamos una variable date para poder agregar por ella.
gener2['date'] = gener2.fecha.dt.date
gener2
fecha | planta | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | date | |
---|---|---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 6259559.00 | 2020-05-15 |
1 | 2020-05-15 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6183645.00 | 2020-05-15 |
2 | 2020-05-15 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 6987759.00 | 2020-05-15 |
3 | 2020-05-15 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7602960.00 | 2020-05-15 |
4 | 2020-05-15 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7158964.00 | 2020-05-15 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
67693 | 2020-06-17 23:45:00 | p2 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 4157.00 | 520758.00 | 2020-06-17 |
67694 | 2020-06-17 23:45:00 | p2 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 3931.00 | 121131356.00 | 2020-06-17 |
67695 | 2020-06-17 23:45:00 | p2 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 4322.00 | 2427691.00 | 2020-06-17 |
67696 | 2020-06-17 23:45:00 | p2 | xMbIugepa2P7lBB | 0.00 | 0.00 | 4218.00 | 106896394.00 | 2020-06-17 |
67697 | 2020-06-17 23:45:00 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 4316.00 | 209335741.00 | 2020-06-17 |
136476 rows × 8 columns
The sum per plant, date and inverter of kw_dc or kw_ac should match the maximum of kw_dia.
gener2 = gener2.groupby(['planta','date','inverter_id']).agg({'kw_dc':sum,
'kw_ac':sum,
'kw_dia':max,
'kw_total':max}).reset_index()
gener2
planta | date | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | |
---|---|---|---|---|---|---|---|
0 | p1 | 2020-05-15 | 1BY6WEcLGh8j5v7 | 235340.70 | 23046.55 | 5754.00 | 6265313.00 |
1 | p1 | 2020-05-15 | 1IF53ai7Xc0U56Y | 258911.11 | 25343.29 | 6357.00 | 6190002.00 |
2 | p1 | 2020-05-15 | 3PZuoBAID5Wc2HD | 254766.05 | 24937.70 | 6274.00 | 6994033.00 |
3 | p1 | 2020-05-15 | 7JYdWkrLSPkdwr4 | 250608.34 | 24533.91 | 6116.00 | 7609076.00 |
4 | p1 | 2020-05-15 | McdE0feGgRqW7Ca | 264030.98 | 25840.63 | 6471.00 | 7165435.00 |
... | ... | ... | ... | ... | ... | ... | ... |
1459 | p2 | 2020-06-17 | q49J1IKaHRwDQnt | 17001.51 | 16655.15 | 4157.00 | 520758.00 |
1460 | p2 | 2020-06-17 | rrq4fwE8jgrTyWY | 16073.93 | 15748.92 | 3931.00 | 121131356.00 |
1461 | p2 | 2020-06-17 | vOuJvMaM2sgwLmb | 17710.00 | 17345.44 | 4322.00 | 2427691.00 |
1462 | p2 | 2020-06-17 | xMbIugepa2P7lBB | 17211.23 | 16860.38 | 5502.00 | 106896394.00 |
1463 | p2 | 2020-06-17 | xoJJ8DcxJEcupym | 17640.42 | 17278.51 | 5327.00 | 209335741.00 |
1464 rows × 7 columns
We order to be able to analyze.
gener2 = gener2.sort_values(['planta','inverter_id','date'])
gener2
planta | date | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | |
---|---|---|---|---|---|---|---|
0 | p1 | 2020-05-15 | 1BY6WEcLGh8j5v7 | 235340.70 | 23046.55 | 5754.00 | 6265313.00 |
22 | p1 | 2020-05-16 | 1BY6WEcLGh8j5v7 | 256629.88 | 25124.49 | 6292.00 | 6271605.00 |
44 | p1 | 2020-05-17 | 1BY6WEcLGh8j5v7 | 288039.82 | 28172.85 | 7045.00 | 6278650.00 |
66 | p1 | 2020-05-18 | 1BY6WEcLGh8j5v7 | 204030.30 | 19970.51 | 4998.00 | 6283648.00 |
88 | p1 | 2020-05-19 | 1BY6WEcLGh8j5v7 | 232277.27 | 22741.18 | 6449.00 | 6290097.00 |
... | ... | ... | ... | ... | ... | ... | ... |
1375 | p2 | 2020-06-13 | xoJJ8DcxJEcupym | 27443.74 | 26840.58 | 6632.00 | 209312200.00 |
1397 | p2 | 2020-06-14 | xoJJ8DcxJEcupym | 30768.78 | 30088.97 | 7268.00 | 209319687.00 |
1419 | p2 | 2020-06-15 | xoJJ8DcxJEcupym | 25597.24 | 25049.49 | 7412.67 | 209325949.00 |
1441 | p2 | 2020-06-16 | xoJJ8DcxJEcupym | 22335.69 | 21870.63 | 6203.20 | 209331425.00 |
1463 | p2 | 2020-06-17 | xoJJ8DcxJEcupym | 17640.42 | 17278.51 | 5327.00 | 209335741.00 |
1464 rows × 7 columns
Kw_dia does not match either kw_dc or kw_ac at all.
We are going to see if it agrees with kw_total, for this we calculate the daily increase of kw_total that should coincide with the maximum of kw_day of the previous day.
gener2['lag1'] = gener2.groupby(['planta','inverter_id']).kw_total.shift(1)
gener2['incremento'] = gener2.kw_total - gener2.lag1
gener2
planta | date | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | lag1 | incremento | |
---|---|---|---|---|---|---|---|---|---|
0 | p1 | 2020-05-15 | 1BY6WEcLGh8j5v7 | 235340.70 | 23046.55 | 5754.00 | 6265313.00 | NaN | NaN |
22 | p1 | 2020-05-16 | 1BY6WEcLGh8j5v7 | 256629.88 | 25124.49 | 6292.00 | 6271605.00 | 6265313.00 | 6292.00 |
44 | p1 | 2020-05-17 | 1BY6WEcLGh8j5v7 | 288039.82 | 28172.85 | 7045.00 | 6278650.00 | 6271605.00 | 7045.00 |
66 | p1 | 2020-05-18 | 1BY6WEcLGh8j5v7 | 204030.30 | 19970.51 | 4998.00 | 6283648.00 | 6278650.00 | 4998.00 |
88 | p1 | 2020-05-19 | 1BY6WEcLGh8j5v7 | 232277.27 | 22741.18 | 6449.00 | 6290097.00 | 6283648.00 | 6449.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1375 | p2 | 2020-06-13 | xoJJ8DcxJEcupym | 27443.74 | 26840.58 | 6632.00 | 209312200.00 | 209305520.00 | 6680.00 |
1397 | p2 | 2020-06-14 | xoJJ8DcxJEcupym | 30768.78 | 30088.97 | 7268.00 | 209319687.00 | 209312200.00 | 7487.00 |
1419 | p2 | 2020-06-15 | xoJJ8DcxJEcupym | 25597.24 | 25049.49 | 7412.67 | 209325949.00 | 209319687.00 | 6262.00 |
1441 | p2 | 2020-06-16 | xoJJ8DcxJEcupym | 22335.69 | 21870.63 | 6203.20 | 209331425.00 | 209325949.00 | 5476.00 |
1463 | p2 | 2020-06-17 | xoJJ8DcxJEcupym | 17640.42 | 17278.51 | 5327.00 | 209335741.00 | 209331425.00 | 4316.00 |
1464 rows × 9 columns
We check on plant 1.
gener2[gener2.planta == 'p1'].head(50)
planta | date | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | lag1 | incremento | |
---|---|---|---|---|---|---|---|---|---|
0 | p1 | 2020-05-15 | 1BY6WEcLGh8j5v7 | 235340.70 | 23046.55 | 5754.00 | 6265313.00 | NaN | NaN |
22 | p1 | 2020-05-16 | 1BY6WEcLGh8j5v7 | 256629.88 | 25124.49 | 6292.00 | 6271605.00 | 6265313.00 | 6292.00 |
44 | p1 | 2020-05-17 | 1BY6WEcLGh8j5v7 | 288039.82 | 28172.85 | 7045.00 | 6278650.00 | 6271605.00 | 7045.00 |
66 | p1 | 2020-05-18 | 1BY6WEcLGh8j5v7 | 204030.30 | 19970.51 | 4998.00 | 6283648.00 | 6278650.00 | 4998.00 |
88 | p1 | 2020-05-19 | 1BY6WEcLGh8j5v7 | 232277.27 | 22741.18 | 6449.00 | 6290097.00 | 6283648.00 | 6449.00 |
110 | p1 | 2020-05-20 | 1BY6WEcLGh8j5v7 | 230412.62 | 22516.26 | 8249.00 | 6298346.00 | 6290097.00 | 8249.00 |
132 | p1 | 2020-05-21 | 1BY6WEcLGh8j5v7 | 288676.60 | 28223.13 | 7243.00 | 6305589.00 | 6298346.00 | 7243.00 |
154 | p1 | 2020-05-22 | 1BY6WEcLGh8j5v7 | 280809.07 | 27456.63 | 6848.00 | 6312437.00 | 6305589.00 | 6848.00 |
176 | p1 | 2020-05-23 | 1BY6WEcLGh8j5v7 | 326468.27 | 31922.93 | 7966.00 | 6320403.00 | 6312437.00 | 7966.00 |
198 | p1 | 2020-05-24 | 1BY6WEcLGh8j5v7 | 309111.73 | 30220.37 | 7537.00 | 6327940.00 | 6320403.00 | 7537.00 |
220 | p1 | 2020-05-25 | 1BY6WEcLGh8j5v7 | 339109.95 | 33144.65 | 8268.00 | 6336208.00 | 6327940.00 | 8268.00 |
242 | p1 | 2020-05-26 | 1BY6WEcLGh8j5v7 | 305515.95 | 29873.74 | 7456.43 | 6343669.00 | 6336208.00 | 7461.00 |
264 | p1 | 2020-05-27 | 1BY6WEcLGh8j5v7 | 251250.52 | 24595.61 | 6164.00 | 6349833.00 | 6343669.00 | 6164.00 |
286 | p1 | 2020-05-28 | 1BY6WEcLGh8j5v7 | 326520.42 | 31917.63 | 7977.00 | 6357810.00 | 6349833.00 | 7977.00 |
308 | p1 | 2020-05-29 | 1BY6WEcLGh8j5v7 | 310776.57 | 30387.48 | 7564.00 | 6365374.00 | 6357810.00 | 7564.00 |
330 | p1 | 2020-05-30 | 1BY6WEcLGh8j5v7 | 276466.89 | 27054.69 | 6754.00 | 6372128.00 | 6365374.00 | 6754.00 |
352 | p1 | 2020-05-31 | 1BY6WEcLGh8j5v7 | 234509.23 | 22928.22 | 5803.00 | 6377931.00 | 6372128.00 | 5803.00 |
374 | p1 | 2020-06-01 | 1BY6WEcLGh8j5v7 | 225219.77 | 22033.25 | 5508.00 | 6383439.00 | 6377931.00 | 5508.00 |
396 | p1 | 2020-06-02 | 1BY6WEcLGh8j5v7 | 286633.75 | 28043.89 | 7029.00 | 6390468.00 | 6383439.00 | 7029.00 |
418 | p1 | 2020-06-03 | 1BY6WEcLGh8j5v7 | 287069.37 | 28085.88 | 7341.00 | 6397809.00 | 6390468.00 | 7341.00 |
440 | p1 | 2020-06-04 | 1BY6WEcLGh8j5v7 | 309653.82 | 30287.02 | 7571.00 | 6405380.00 | 6397809.00 | 7571.00 |
462 | p1 | 2020-06-05 | 1BY6WEcLGh8j5v7 | 293353.25 | 28693.91 | 7162.00 | 6412542.00 | 6405380.00 | 7162.00 |
484 | p1 | 2020-06-06 | 1BY6WEcLGh8j5v7 | 243297.18 | 23793.42 | 5940.00 | 6418482.00 | 6412542.00 | 5940.00 |
506 | p1 | 2020-06-07 | 1BY6WEcLGh8j5v7 | 217217.71 | 21247.58 | 5268.00 | 6423750.00 | 6418482.00 | 5268.00 |
528 | p1 | 2020-06-08 | 1BY6WEcLGh8j5v7 | 316710.85 | 30969.72 | 7864.00 | 6431614.00 | 6423750.00 | 7864.00 |
550 | p1 | 2020-06-09 | 1BY6WEcLGh8j5v7 | 301395.46 | 29476.87 | 7456.00 | 6439070.00 | 6431614.00 | 7456.00 |
572 | p1 | 2020-06-10 | 1BY6WEcLGh8j5v7 | 242187.71 | 23707.68 | 5911.00 | 6444981.00 | 6439070.00 | 5911.00 |
594 | p1 | 2020-06-11 | 1BY6WEcLGh8j5v7 | 214404.59 | 21003.57 | 5257.00 | 6450238.00 | 6444981.00 | 5257.00 |
616 | p1 | 2020-06-12 | 1BY6WEcLGh8j5v7 | 222758.55 | 21813.00 | 5441.00 | 6455679.00 | 6450238.00 | 5441.00 |
638 | p1 | 2020-06-13 | 1BY6WEcLGh8j5v7 | 326297.09 | 31895.25 | 7984.00 | 6463663.00 | 6455679.00 | 7984.00 |
660 | p1 | 2020-06-14 | 1BY6WEcLGh8j5v7 | 164127.30 | 16085.33 | 4012.00 | 6467675.00 | 6463663.00 | 4012.00 |
682 | p1 | 2020-06-15 | 1BY6WEcLGh8j5v7 | 255298.41 | 24986.86 | 6275.00 | 6473950.00 | 6467675.00 | 6275.00 |
704 | p1 | 2020-06-16 | 1BY6WEcLGh8j5v7 | 237829.61 | 23288.02 | 5848.00 | 6479798.00 | 6473950.00 | 5848.00 |
726 | p1 | 2020-06-17 | 1BY6WEcLGh8j5v7 | 224315.74 | 21957.17 | 5521.00 | 6485319.00 | 6479798.00 | 5521.00 |
1 | p1 | 2020-05-15 | 1IF53ai7Xc0U56Y | 258911.11 | 25343.29 | 6357.00 | 6190002.00 | NaN | NaN |
23 | p1 | 2020-05-16 | 1IF53ai7Xc0U56Y | 269678.23 | 26396.93 | 6592.00 | 6196594.00 | 6190002.00 | 6592.00 |
45 | p1 | 2020-05-17 | 1IF53ai7Xc0U56Y | 316701.61 | 30962.51 | 7759.00 | 6204353.00 | 6196594.00 | 7759.00 |
67 | p1 | 2020-05-18 | 1IF53ai7Xc0U56Y | 221984.75 | 21723.95 | 5453.00 | 6209806.00 | 6204353.00 | 5453.00 |
89 | p1 | 2020-05-19 | 1IF53ai7Xc0U56Y | 237092.43 | 23202.78 | 6568.75 | 6216430.00 | 6209806.00 | 6624.00 |
111 | p1 | 2020-05-20 | 1IF53ai7Xc0U56Y | 251521.07 | 24564.96 | 8997.00 | 6225427.00 | 6216430.00 | 8997.00 |
133 | p1 | 2020-05-21 | 1IF53ai7Xc0U56Y | 310547.48 | 30352.05 | 7816.50 | 6233247.00 | 6225427.00 | 7820.00 |
155 | p1 | 2020-05-22 | 1IF53ai7Xc0U56Y | 296998.09 | 29011.33 | 7287.00 | 6240534.00 | 6233247.00 | 7287.00 |
177 | p1 | 2020-05-23 | 1IF53ai7Xc0U56Y | 355283.96 | 34726.34 | 8673.00 | 6249207.00 | 6240534.00 | 8673.00 |
199 | p1 | 2020-05-24 | 1IF53ai7Xc0U56Y | 339436.82 | 33170.41 | 8273.00 | 6257480.00 | 6249207.00 | 8273.00 |
221 | p1 | 2020-05-25 | 1IF53ai7Xc0U56Y | 371166.75 | 36249.54 | 9048.00 | 6266528.00 | 6257480.00 | 9048.00 |
243 | p1 | 2020-05-26 | 1IF53ai7Xc0U56Y | 332953.20 | 32544.54 | 8109.50 | 6274638.00 | 6266528.00 | 8110.00 |
265 | p1 | 2020-05-27 | 1IF53ai7Xc0U56Y | 279934.18 | 27390.71 | 6824.00 | 6281462.00 | 6274638.00 | 6824.00 |
287 | p1 | 2020-05-28 | 1IF53ai7Xc0U56Y | 348916.90 | 34097.52 | 8496.00 | 6289958.00 | 6281462.00 | 8496.00 |
309 | p1 | 2020-05-29 | 1IF53ai7Xc0U56Y | 338805.39 | 33116.10 | 8253.00 | 6298211.00 | 6289958.00 | 8253.00 |
331 | p1 | 2020-05-30 | 1IF53ai7Xc0U56Y | 300749.29 | 29414.52 | 7332.00 | 6305543.00 | 6298211.00 | 7332.00 |
We check on plant 2.
gener2[gener2.planta == 'p2'].head(50)
planta | date | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | lag1 | incremento | |
---|---|---|---|---|---|---|---|---|---|
748 | p2 | 2020-05-15 | 4UPUqMRk7TRMgml | 17112.65 | 16744.08 | 9425.00 | 2433212.00 | NaN | NaN |
770 | p2 | 2020-05-16 | 4UPUqMRk7TRMgml | 23305.45 | 22791.22 | 5677.00 | 2438889.00 | 2433212.00 | 5677.00 |
792 | p2 | 2020-05-17 | 4UPUqMRk7TRMgml | 25985.95 | 25430.42 | 6342.00 | 2445231.00 | 2438889.00 | 6342.00 |
814 | p2 | 2020-05-18 | 4UPUqMRk7TRMgml | 31218.58 | 30516.03 | 7641.00 | 2452872.00 | 2445231.00 | 7641.00 |
836 | p2 | 2020-05-19 | 4UPUqMRk7TRMgml | 8040.00 | 7878.39 | 7641.00 | 2454841.00 | 2452872.00 | 1969.00 |
858 | p2 | 2020-05-20 | 4UPUqMRk7TRMgml | 9852.13 | 9641.35 | 9423.00 | 2464264.00 | 2454841.00 | 9423.00 |
880 | p2 | 2020-05-21 | 4UPUqMRk7TRMgml | 37403.77 | 36540.63 | 9423.00 | 2473415.00 | 2464264.00 | 9151.00 |
898 | p2 | 2020-05-22 | 4UPUqMRk7TRMgml | 40414.35 | 39472.14 | 9863.00 | 2483278.00 | 2473415.00 | 9863.00 |
916 | p2 | 2020-05-23 | 4UPUqMRk7TRMgml | 39690.66 | 38769.69 | 9863.00 | 2492966.00 | 2483278.00 | 9688.00 |
934 | p2 | 2020-05-24 | 4UPUqMRk7TRMgml | 18963.72 | 18558.18 | 9688.00 | 2497601.00 | 2492966.00 | 4635.00 |
952 | p2 | 2020-05-25 | 4UPUqMRk7TRMgml | 37659.21 | 36789.31 | 9224.00 | 2506825.00 | 2497601.00 | 9224.00 |
970 | p2 | 2020-05-26 | 4UPUqMRk7TRMgml | 38900.22 | 37995.67 | 9501.00 | 2516326.00 | 2506825.00 | 9501.00 |
988 | p2 | 2020-05-27 | 4UPUqMRk7TRMgml | 38919.76 | 38017.83 | 9501.00 | 2525827.00 | 2516326.00 | 9501.00 |
1006 | p2 | 2020-05-28 | 4UPUqMRk7TRMgml | 15261.36 | 14937.97 | 9501.00 | 2529556.00 | 2525827.00 | 3729.00 |
1024 | p2 | 2020-05-29 | 4UPUqMRk7TRMgml | 32521.41 | 31811.48 | 8126.00 | 2537682.00 | 2529556.00 | 8126.00 |
1046 | p2 | 2020-05-30 | 4UPUqMRk7TRMgml | 33420.25 | 32663.54 | 8185.00 | 2545867.00 | 2537682.00 | 8185.00 |
1068 | p2 | 2020-05-31 | 4UPUqMRk7TRMgml | 27317.90 | 26701.78 | 8185.00 | 2552559.00 | 2545867.00 | 6692.00 |
1090 | p2 | 2020-06-01 | 4UPUqMRk7TRMgml | 17867.74 | 17462.72 | 6692.00 | 2556897.00 | 2552559.00 | 4338.00 |
1112 | p2 | 2020-06-02 | 4UPUqMRk7TRMgml | 30176.94 | 29508.22 | 7404.00 | 2564301.00 | 2556897.00 | 7404.00 |
1134 | p2 | 2020-06-03 | 4UPUqMRk7TRMgml | 23834.47 | 23319.73 | 7404.00 | 2570414.00 | 2564301.00 | 6113.00 |
1156 | p2 | 2020-06-04 | 4UPUqMRk7TRMgml | 28736.11 | 28109.01 | 6990.00 | 2577404.00 | 2570414.00 | 6990.00 |
1178 | p2 | 2020-06-05 | 4UPUqMRk7TRMgml | 31594.37 | 30884.49 | 7756.00 | 2585160.00 | 2577404.00 | 7756.00 |
1200 | p2 | 2020-06-06 | 4UPUqMRk7TRMgml | 34226.22 | 33435.09 | 8376.00 | 2593536.00 | 2585160.00 | 8376.00 |
1222 | p2 | 2020-06-07 | 4UPUqMRk7TRMgml | 32026.91 | 31285.99 | 8376.00 | 2601362.00 | 2593536.00 | 7826.00 |
1244 | p2 | 2020-06-08 | 4UPUqMRk7TRMgml | 37814.61 | 36940.31 | 9210.00 | 2610588.00 | 2601362.00 | 9226.00 |
1266 | p2 | 2020-06-09 | 4UPUqMRk7TRMgml | 33205.47 | 32465.97 | 9217.93 | 2618699.00 | 2610588.00 | 8111.00 |
1288 | p2 | 2020-06-10 | 4UPUqMRk7TRMgml | 22793.26 | 22291.98 | 8098.00 | 2624287.00 | 2618699.00 | 5588.00 |
1310 | p2 | 2020-06-11 | 4UPUqMRk7TRMgml | 16069.58 | 15745.07 | 5588.00 | 2628220.00 | 2624287.00 | 3933.00 |
1332 | p2 | 2020-06-12 | 4UPUqMRk7TRMgml | 18684.47 | 18306.11 | 4548.00 | 2632794.00 | 2628220.00 | 4574.00 |
1354 | p2 | 2020-06-13 | 4UPUqMRk7TRMgml | 25901.76 | 25335.74 | 6289.00 | 2639150.00 | 2632794.00 | 6356.00 |
1376 | p2 | 2020-06-14 | 4UPUqMRk7TRMgml | 8863.51 | 8699.19 | 6462.80 | 2641320.00 | 2639150.00 | 2170.00 |
1398 | p2 | 2020-06-15 | 4UPUqMRk7TRMgml | 10205.79 | 9999.45 | 2304.00 | 2643840.00 | 2641320.00 | 2520.00 |
1420 | p2 | 2020-06-16 | 4UPUqMRk7TRMgml | 21302.43 | 20858.99 | 5004.00 | 2649049.00 | 2643840.00 | 5209.00 |
1442 | p2 | 2020-06-17 | 4UPUqMRk7TRMgml | 18153.88 | 17780.37 | 5004.00 | 2653495.00 | 2649049.00 | 4446.00 |
749 | p2 | 2020-05-15 | 81aHJ1q11NBPMrL | 13169.45 | 12896.82 | 3214.00 | 1215281950.00 | NaN | NaN |
771 | p2 | 2020-05-16 | 81aHJ1q11NBPMrL | 19238.71 | 18812.08 | 4857.00 | 1215286807.00 | 1215281950.00 | 4857.00 |
793 | p2 | 2020-05-17 | 81aHJ1q11NBPMrL | 23577.93 | 23078.49 | 5758.00 | 1215292565.00 | 1215286807.00 | 5758.00 |
815 | p2 | 2020-05-18 | 81aHJ1q11NBPMrL | 14117.46 | 13820.30 | 3485.00 | 1215296050.00 | 1215292565.00 | 3485.00 |
837 | p2 | 2020-05-19 | 81aHJ1q11NBPMrL | 27893.17 | 27253.87 | 6771.00 | 1215302821.00 | 1215296050.00 | 6771.00 |
859 | p2 | 2020-05-20 | 81aHJ1q11NBPMrL | 14040.81 | 13748.34 | 3439.00 | 1215306260.00 | 1215302821.00 | 3439.00 |
881 | p2 | 2020-05-21 | 81aHJ1q11NBPMrL | 29804.31 | 29105.05 | 8570.00 | 1215314830.00 | 1215306260.00 | 8570.00 |
899 | p2 | 2020-05-22 | 81aHJ1q11NBPMrL | 13268.32 | 12992.02 | 3269.00 | 1215318099.00 | 1215314830.00 | 3269.00 |
917 | p2 | 2020-05-23 | 81aHJ1q11NBPMrL | 27742.32 | 27098.60 | 9333.00 | 1215327432.00 | 1215318099.00 | 9333.00 |
935 | p2 | 2020-05-24 | 81aHJ1q11NBPMrL | 15849.25 | 15512.54 | 9100.00 | 1215336532.00 | 1215327432.00 | 9100.00 |
953 | p2 | 2020-05-25 | 81aHJ1q11NBPMrL | 35834.31 | 35018.35 | 8860.00 | 1215345392.00 | 1215336532.00 | 8860.00 |
971 | p2 | 2020-05-26 | 81aHJ1q11NBPMrL | 14655.46 | 14326.96 | 3212.67 | 1215348540.93 | 1215345392.00 | 3148.93 |
989 | p2 | 2020-05-27 | 81aHJ1q11NBPMrL | 5380.56 | 5270.20 | 9138.00 | 1215363661.00 | 1215348540.93 | 15120.07 |
1007 | p2 | 2020-05-28 | 81aHJ1q11NBPMrL | 36201.81 | 35369.55 | 8947.00 | 1215372608.00 | 1215363661.00 | 8947.00 |
1025 | p2 | 2020-05-29 | 81aHJ1q11NBPMrL | 18928.49 | 18527.04 | 4812.00 | 1215377420.00 | 1215372608.00 | 4812.00 |
1047 | p2 | 2020-05-30 | 81aHJ1q11NBPMrL | 29388.91 | 28733.23 | 7156.00 | 1215384576.00 | 1215377420.00 | 7156.00 |
Conclusions: * kw_dia is consistent with kw_total * but these are not consistent with kw_dc or kw_ac * it's as if they are in different units or there is some calculation that we are not aware of * therefore we will have 2 blocks to be able to use: either kw_dc with kw_ac, or kw_dia with kw_total, but we cannot mix them together
temper = pd.concat([p1w,p2w], axis = 'index')
temper
DATE_TIME | PLANT_ID | SOURCE_KEY | AMBIENT_TEMPERATURE | MODULE_TEMPERATURE | IRRADIATION | |
---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | p1 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
1 | 2020-05-15 00:15:00 | p1 | HmiyD2TTLFNqkNe | 25.08 | 22.76 | 0.00 |
2 | 2020-05-15 00:30:00 | p1 | HmiyD2TTLFNqkNe | 24.94 | 22.59 | 0.00 |
3 | 2020-05-15 00:45:00 | p1 | HmiyD2TTLFNqkNe | 24.85 | 22.36 | 0.00 |
4 | 2020-05-15 01:00:00 | p1 | HmiyD2TTLFNqkNe | 24.62 | 22.17 | 0.00 |
... | ... | ... | ... | ... | ... | ... |
3254 | 2020-06-17 22:45:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.51 | 22.86 | 0.00 |
3255 | 2020-06-17 23:00:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.48 | 22.74 | 0.00 |
3256 | 2020-06-17 23:15:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.35 | 22.49 | 0.00 |
3257 | 2020-06-17 23:30:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.29 | 22.37 | 0.00 |
3258 | 2020-06-17 23:45:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
6441 rows × 6 columns
Let's rename the variables now to make them more descriptive and usable.
temper.columns = ['fecha','planta','sensor_id','t_ambiente','t_modulo','irradiacion']
temper
fecha | planta | sensor_id | t_ambiente | t_modulo | irradiacion | |
---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | p1 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
1 | 2020-05-15 00:15:00 | p1 | HmiyD2TTLFNqkNe | 25.08 | 22.76 | 0.00 |
2 | 2020-05-15 00:30:00 | p1 | HmiyD2TTLFNqkNe | 24.94 | 22.59 | 0.00 |
3 | 2020-05-15 00:45:00 | p1 | HmiyD2TTLFNqkNe | 24.85 | 22.36 | 0.00 |
4 | 2020-05-15 01:00:00 | p1 | HmiyD2TTLFNqkNe | 24.62 | 22.17 | 0.00 |
... | ... | ... | ... | ... | ... | ... |
3254 | 2020-06-17 22:45:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.51 | 22.86 | 0.00 |
3255 | 2020-06-17 23:00:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.48 | 22.74 | 0.00 |
3256 | 2020-06-17 23:15:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.35 | 22.49 | 0.00 |
3257 | 2020-06-17 23:30:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.29 | 22.37 | 0.00 |
3258 | 2020-06-17 23:45:00 | p2 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
6441 rows × 6 columns
In this case, the key field is made up of date and plant and commands the generation dataset, since the temperature field only provides us with additional variables.
df = pd.merge(left = gener, right = temper, how = 'left', on = ['fecha','planta'])
df
fecha | planta | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | sensor_id | t_ambiente | t_modulo | irradiacion | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 6259559.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
1 | 2020-05-15 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6183645.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
2 | 2020-05-15 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 6987759.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
3 | 2020-05-15 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7602960.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
4 | 2020-05-15 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7158964.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
136471 | 2020-06-17 23:45:00 | p2 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 4157.00 | 520758.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136472 | 2020-06-17 23:45:00 | p2 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 3931.00 | 121131356.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136473 | 2020-06-17 23:45:00 | p2 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 4322.00 | 2427691.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136474 | 2020-06-17 23:45:00 | p2 | xMbIugepa2P7lBB | 0.00 | 0.00 | 4218.00 | 106896394.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136475 | 2020-06-17 23:45:00 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 4316.00 | 209335741.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136476 rows × 11 columns
After an integration it is always convenient to check if nulls have been generated.
df.isna().sum()
fecha 0
planta 0
inverter_id 0
kw_dc 0
kw_ac 0
kw_dia 0
kw_total 0
sensor_id 4
t_ambiente 4
t_modulo 4
irradiacion 4
dtype: int64
We search if the nulls fulfill some pattern.
nulos = df[df.sensor_id.isna()]
nulos
fecha | planta | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | sensor_id | t_ambiente | t_modulo | irradiacion | |
---|---|---|---|---|---|---|---|---|---|---|---|
38544 | 2020-06-03 14:00:00 | p1 | 1IF53ai7Xc0U56Y | 7003.00 | 685.80 | 5601.00 | 6330385.00 | NaN | NaN | NaN | NaN |
38545 | 2020-06-03 14:00:00 | p1 | adLQvlD726eNBSB | 7204.00 | 705.40 | 5685.00 | 6419961.00 | NaN | NaN | NaN | NaN |
38546 | 2020-06-03 14:00:00 | p1 | wCURE6d3bPkepu2 | 7545.00 | 738.70 | 5579.00 | 6928448.00 | NaN | NaN | NaN | NaN |
38547 | 2020-06-03 14:00:00 | p1 | z9Y9gH1T5YWrNuG | 7946.00 | 777.80 | 5541.00 | 7152815.00 | NaN | NaN | NaN | NaN |
It is about June 3 at 2:00 p.m., which for some reason does not have temperature data but only for 4 inverters on floor 1.
We are going to search the temperature dataset if that datetime exists.
temper[temper.fecha.between('2020-06-03 13:30:00', '2020-06-03 14:30:00')]
fecha | planta | sensor_id | t_ambiente | t_modulo | irradiacion | |
---|---|---|---|---|---|---|
1797 | 2020-06-03 13:30:00 | p1 | HmiyD2TTLFNqkNe | 28.56 | 48.78 | 0.75 |
1798 | 2020-06-03 13:45:00 | p1 | HmiyD2TTLFNqkNe | 27.86 | 46.63 | 0.62 |
1799 | 2020-06-03 14:15:00 | p1 | HmiyD2TTLFNqkNe | 28.36 | 50.63 | 0.72 |
1800 | 2020-06-03 14:30:00 | p1 | HmiyD2TTLFNqkNe | 28.31 | 47.38 | 0.67 |
1874 | 2020-06-03 13:30:00 | p2 | iq8k7ZNt4Mwm3w0 | 28.60 | 37.68 | 0.41 |
1875 | 2020-06-03 13:45:00 | p2 | iq8k7ZNt4Mwm3w0 | 28.75 | 36.35 | 0.35 |
1876 | 2020-06-03 14:15:00 | p2 | iq8k7ZNt4Mwm3w0 | 30.45 | 45.63 | 0.76 |
1877 | 2020-06-03 14:30:00 | p2 | iq8k7ZNt4Mwm3w0 | 29.64 | 40.40 | 0.44 |
Indeed we see that this section is missing on both floors. However, there are only measurements at that time on floor 1, and only on 4 inverters.
So there would be two solutions:
Since it seems like a measurement strip of its own, only 4 inverters on floor 1, we are going to choose to eliminate them.
df.dropna(inplace = True)
df
fecha | planta | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | sensor_id | t_ambiente | t_modulo | irradiacion | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-05-15 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 6259559.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
1 | 2020-05-15 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6183645.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
2 | 2020-05-15 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 6987759.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
3 | 2020-05-15 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7602960.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
4 | 2020-05-15 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7158964.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
136471 | 2020-06-17 23:45:00 | p2 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 4157.00 | 520758.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136472 | 2020-06-17 23:45:00 | p2 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 3931.00 | 121131356.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136473 | 2020-06-17 23:45:00 | p2 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 4322.00 | 2427691.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136474 | 2020-06-17 23:45:00 | p2 | xMbIugepa2P7lBB | 0.00 | 0.00 | 4218.00 | 106896394.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136475 | 2020-06-17 23:45:00 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 4316.00 | 209335741.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136472 rows × 11 columns
Finally we are going to pass the date to the index to be able to use all the power of Pandas.
df.set_index('fecha', inplace = True)
df
planta | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | sensor_id | t_ambiente | t_modulo | irradiacion | |
---|---|---|---|---|---|---|---|---|---|---|
fecha | ||||||||||
2020-05-15 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 6259559.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
2020-05-15 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6183645.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
2020-05-15 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 6987759.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
2020-05-15 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7602960.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
2020-05-15 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7158964.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-17 23:45:00 | p2 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 4157.00 | 520758.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
2020-06-17 23:45:00 | p2 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 3931.00 | 121131356.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
2020-06-17 23:45:00 | p2 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 4322.00 | 2427691.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
2020-06-17 23:45:00 | p2 | xMbIugepa2P7lBB | 0.00 | 0.00 | 4218.00 | 106896394.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
2020-06-17 23:45:00 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 4316.00 | 209335741.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136472 rows × 10 columns
df.to_pickle('df.pickle')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import fiser_tools as fs
fs.misc.dark_theme()
#Automcompletar rápido
%config IPCompleter.greedy=True
#Formato de display
pd.options.display.float_format = '{:15.2f}'.format
df = pd.read_pickle('df.pickle')
df
planta | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | sensor_id | t_ambiente | t_modulo | irradiacion | |
---|---|---|---|---|---|---|---|---|---|---|
fecha | ||||||||||
2020-05-15 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 6259559.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
2020-05-15 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6183645.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
2020-05-15 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 6987759.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
2020-05-15 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7602960.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
2020-05-15 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7158964.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-17 23:45:00 | p2 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 4157.00 | 520758.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
2020-06-17 23:45:00 | p2 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 3931.00 | 121131356.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
2020-06-17 23:45:00 | p2 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 4322.00 | 2427691.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
2020-06-17 23:45:00 | p2 | xMbIugepa2P7lBB | 0.00 | 0.00 | 4218.00 | 106896394.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
2020-06-17 23:45:00 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 4316.00 | 209335741.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 |
136472 rows × 10 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 136472 entries, 2020-05-15 00:00:00 to 2020-06-17 23:45:00
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 planta 136472 non-null object
1 inverter_id 136472 non-null object
2 kw_dc 136472 non-null float64
3 kw_ac 136472 non-null float64
4 kw_dia 136472 non-null float64
5 kw_total 136472 non-null float64
6 sensor_id 136472 non-null object
7 t_ambiente 136472 non-null float64
8 t_modulo 136472 non-null float64
9 irradiacion 136472 non-null float64
dtypes: float64(7), object(3)
memory usage: 11.5+ MB
We start by extracting the date components and adding them as new variables.
def componentes_fecha(dataframe):
mes = dataframe.index.month
dia = dataframe.index.day
hora = dataframe.index.hour
minuto = dataframe.index.minute
return(pd.DataFrame({'mes':mes, 'dia':dia, 'hora':hora, 'minuto':minuto}))
df = pd.concat([df.reset_index(),componentes_fecha(df)], axis = 1).set_index('fecha')
df
planta | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | sensor_id | t_ambiente | t_modulo | irradiacion | mes | dia | hora | minuto | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | ||||||||||||||
2020-05-15 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 6259559.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 | 5 | 15 | 0 | 0 |
2020-05-15 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6183645.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 | 5 | 15 | 0 | 0 |
2020-05-15 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 6987759.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 | 5 | 15 | 0 | 0 |
2020-05-15 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7602960.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 | 5 | 15 | 0 | 0 |
2020-05-15 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7158964.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 | 5 | 15 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-17 23:45:00 | p2 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 4157.00 | 520758.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 | 6 | 17 | 23 | 45 |
2020-06-17 23:45:00 | p2 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 3931.00 | 121131356.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 | 6 | 17 | 23 | 45 |
2020-06-17 23:45:00 | p2 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 4322.00 | 2427691.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 | 6 | 17 | 23 | 45 |
2020-06-17 23:45:00 | p2 | xMbIugepa2P7lBB | 0.00 | 0.00 | 4218.00 | 106896394.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 | 6 | 17 | 23 | 45 |
2020-06-17 23:45:00 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 4316.00 | 209335741.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 | 6 | 17 | 23 | 45 |
136472 rows × 14 columns
We are going to create the inverter efficiency variable, which consists of the percentage of DC that successfully transforms to AC.
But we are presented with a very common difficulty in ratios, that the denominator can be zero.
If that were the case, when doing the ratio it would return a null.
In our case the denominator is DC, so if the DC generation were zero, the AC generation should also be zero.
We can correct that simply by imputing the nulls that come out with zeros.
def eficiencia_inverter(AC,DC):
temp = AC / DC * 100
return(temp.fillna(0))
df['eficiencia'] = eficiencia_inverter(df.kw_ac, df.kw_dc)
We check that it has not generated nulls.
df.eficiencia.isna().sum()
0
We visualize efficiency globally.
df.eficiencia.plot.kde();
Here is something important.
There are two clearly differentiated groups and one of them is clearly inefficient.
But for now we'll leave it written down and later we'll review which entity is having problems: plant, inverter, etc.
In this case it is very important not to start analyzing by analyzing, but to follow the plan defined in the project design, since there is a very clear order in the process: environmental factors --> kw_dc --> kw ac.
So let's rearrange the columns of the df to help us interpret in this order.
df
planta | inverter_id | kw_dc | kw_ac | kw_dia | kw_total | sensor_id | t_ambiente | t_modulo | irradiacion | mes | dia | hora | minuto | eficiencia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||||||||
2020-05-15 00:00:00 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 6259559.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 | 5 | 15 | 0 | 0 | 0.00 |
2020-05-15 00:00:00 | p1 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 6183645.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 | 5 | 15 | 0 | 0 | 0.00 |
2020-05-15 00:00:00 | p1 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 6987759.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 | 5 | 15 | 0 | 0 | 0.00 |
2020-05-15 00:00:00 | p1 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 7602960.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 | 5 | 15 | 0 | 0 | 0.00 |
2020-05-15 00:00:00 | p1 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 7158964.00 | HmiyD2TTLFNqkNe | 25.18 | 22.86 | 0.00 | 5 | 15 | 0 | 0 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-17 23:45:00 | p2 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 4157.00 | 520758.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 | 6 | 17 | 23 | 45 | 0.00 |
2020-06-17 23:45:00 | p2 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 3931.00 | 121131356.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 | 6 | 17 | 23 | 45 | 0.00 |
2020-06-17 23:45:00 | p2 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 4322.00 | 2427691.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 | 6 | 17 | 23 | 45 | 0.00 |
2020-06-17 23:45:00 | p2 | xMbIugepa2P7lBB | 0.00 | 0.00 | 4218.00 | 106896394.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 | 6 | 17 | 23 | 45 | 0.00 |
2020-06-17 23:45:00 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 4316.00 | 209335741.00 | iq8k7ZNt4Mwm3w0 | 23.20 | 22.54 | 0.00 | 6 | 17 | 23 | 45 | 0.00 |
136472 rows × 15 columns
orden = ['planta','mes','dia','hora','minuto','sensor_id','irradiacion','t_ambiente','t_modulo','inverter_id','kw_dc','kw_ac','eficiencia','kw_dia','kw_total']
df = df[orden]
df
planta | mes | dia | hora | minuto | sensor_id | irradiacion | t_ambiente | t_modulo | inverter_id | kw_dc | kw_ac | eficiencia | kw_dia | kw_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||||||||
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 0.00 | 6259559.00 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 0.00 | 6183645.00 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 0.00 | 6987759.00 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 0.00 | 7602960.00 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 0.00 | 7158964.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 0.00 | 4157.00 | 520758.00 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 0.00 | 3931.00 | 121131356.00 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 0.00 | 4322.00 | 2427691.00 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | xMbIugepa2P7lBB | 0.00 | 0.00 | 0.00 | 4218.00 | 106896394.00 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 0.00 | 4316.00 | 209335741.00 |
136472 rows × 15 columns
The level of analysis at which we have the data is every 15 minutes, which may be too disaggregated for certain analyses.
We are going to build a version of the dataframe added to the day level.
For this we use resample to do downgrading.
We must add by plant and inverter, which are the key fields of our dataset.
Since we have variables to which different aggregation functions apply we can use the dictionary format of agg()
df.head()
planta | mes | dia | hora | minuto | sensor_id | irradiacion | t_ambiente | t_modulo | inverter_id | kw_dc | kw_ac | eficiencia | kw_dia | kw_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||||||||
2020-05-15 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 0.00 | 6259559.00 |
2020-05-15 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 0.00 | 6183645.00 |
2020-05-15 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 0.00 | 6987759.00 |
2020-05-15 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 0.00 | 7602960.00 |
2020-05-15 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 0.00 | 7158964.00 |
df_dia = df.groupby(['planta', 'inverter_id']).resample('D') \
.agg({'irradiacion': [min,np.mean,max],
't_ambiente': [min,np.mean,max],
't_modulo': [min,np.mean,max],
'kw_dc': [min,np.mean,max,sum],
'kw_ac': [min,np.mean,max,sum],
'eficiencia': [min,np.mean,max],
'kw_dia': max,
'kw_total': max})
df_dia
irradiacion | t_ambiente | t_modulo | kw_dc | kw_ac | eficiencia | kw_dia | kw_total | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | mean | max | min | mean | max | min | mean | max | min | mean | max | sum | min | mean | max | sum | min | mean | max | max | max | |||
planta | inverter_id | fecha | ||||||||||||||||||||||
p1 | 1BY6WEcLGh8j5v7 | 2020-05-15 | 0.00 | 0.20 | 0.89 | 22.04 | 27.43 | 34.43 | 20.29 | 32.58 | 55.03 | 0.00 | 2530.55 | 10642.75 | 235340.70 | 0.00 | 247.81 | 1039.35 | 23046.55 | 0.00 | 5.26 | 9.82 | 5754.00 | 6265313.00 |
2020-05-16 | 0.00 | 0.21 | 0.81 | 21.50 | 26.78 | 32.52 | 19.59 | 31.86 | 54.23 | 0.00 | 2916.25 | 11209.00 | 256629.88 | 0.00 | 285.51 | 1095.29 | 25124.49 | 0.00 | 5.56 | 9.83 | 6292.00 | 6271605.00 | ||
2020-05-17 | 0.00 | 0.24 | 1.00 | 21.21 | 26.69 | 35.25 | 20.38 | 32.74 | 63.15 | 0.00 | 3000.41 | 11416.43 | 288039.82 | 0.00 | 293.47 | 1114.81 | 28172.85 | 0.00 | 4.99 | 9.82 | 7045.00 | 6278650.00 | ||
2020-05-18 | 0.00 | 0.16 | 0.97 | 20.96 | 23.85 | 28.37 | 19.48 | 27.81 | 53.94 | 0.00 | 2125.32 | 12238.86 | 204030.30 | 0.00 | 208.03 | 1193.63 | 19970.51 | 0.00 | 4.99 | 9.83 | 4998.00 | 6283648.00 | ||
2020-05-19 | 0.00 | 0.19 | 0.84 | 22.39 | 25.34 | 30.37 | 20.06 | 29.73 | 51.85 | 0.00 | 2497.61 | 10854.50 | 232277.27 | 0.00 | 244.53 | 1059.80 | 22741.18 | 0.00 | 4.63 | 9.83 | 6449.00 | 6290097.00 | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
p2 | xoJJ8DcxJEcupym | 2020-06-13 | 0.00 | 0.22 | 0.93 | 22.20 | 26.12 | 31.91 | 20.88 | 30.39 | 51.01 | 0.00 | 285.87 | 1271.15 | 27443.74 | 0.00 | 279.59 | 1239.95 | 26840.58 | 0.00 | 51.91 | 98.29 | 6632.00 | 209312200.00 |
2020-06-14 | 0.00 | 0.23 | 0.92 | 23.65 | 27.02 | 32.96 | 22.12 | 31.59 | 52.99 | 0.00 | 320.51 | 1362.06 | 30768.78 | 0.00 | 313.43 | 1328.21 | 30088.97 | 0.00 | 51.94 | 99.92 | 7268.00 | 209319687.00 | ||
2020-06-15 | 0.00 | 0.19 | 0.83 | 24.00 | 26.56 | 31.61 | 23.00 | 30.00 | 48.46 | 0.00 | 266.64 | 1288.28 | 25597.24 | 0.00 | 260.93 | 1256.67 | 25049.49 | 0.00 | 52.94 | 98.27 | 7412.67 | 209325949.00 | ||
2020-06-16 | 0.00 | 0.17 | 0.77 | 23.63 | 26.37 | 30.83 | 22.56 | 29.59 | 46.36 | 0.00 | 232.66 | 1124.97 | 22335.69 | 0.00 | 227.82 | 1098.21 | 21870.63 | 0.00 | 50.90 | 98.29 | 6203.20 | 209331425.00 | ||
2020-06-17 | 0.00 | 0.12 | 0.58 | 22.55 | 24.85 | 29.04 | 21.91 | 26.67 | 42.33 | 0.00 | 183.75 | 828.77 | 17640.42 | 0.00 | 179.98 | 810.77 | 17278.51 | 0.00 | 52.90 | 98.32 | 5327.00 | 209335741.00 |
1496 rows × 22 columns
It has been generated for us with a multi-index, both in rows and in columns.
To remove the from the columns we can flatten the names with .to_flat_index().
This returns the levels in tuples, which we can then join with a list comprehension.
Let's review how to_flat_index() returns the column names
tuplas = df_dia.columns.to_flat_index()
tuplas
Index([ ('irradiacion', 'min'), ('irradiacion', 'mean'),
('irradiacion', 'max'), ('t_ambiente', 'min'),
('t_ambiente', 'mean'), ('t_ambiente', 'max'),
('t_modulo', 'min'), ('t_modulo', 'mean'),
('t_modulo', 'max'), ('kw_dc', 'min'),
('kw_dc', 'mean'), ('kw_dc', 'max'),
('kw_dc', 'sum'), ('kw_ac', 'min'),
('kw_ac', 'mean'), ('kw_ac', 'max'),
('kw_ac', 'sum'), ('eficiencia', 'min'),
('eficiencia', 'mean'), ('eficiencia', 'max'),
('kw_dia', 'max'), ('kw_total', 'max')],
dtype='object')
And we join both parts of the pair with an underscore using .join
df_dia.columns = ["_".join(par) for par in tuplas]
df_dia
irradiacion_min | irradiacion_mean | irradiacion_max | t_ambiente_min | t_ambiente_mean | t_ambiente_max | t_modulo_min | t_modulo_mean | t_modulo_max | kw_dc_min | kw_dc_mean | kw_dc_max | kw_dc_sum | kw_ac_min | kw_ac_mean | kw_ac_max | kw_ac_sum | eficiencia_min | eficiencia_mean | eficiencia_max | kw_dia_max | kw_total_max | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
planta | inverter_id | fecha | ||||||||||||||||||||||
p1 | 1BY6WEcLGh8j5v7 | 2020-05-15 | 0.00 | 0.20 | 0.89 | 22.04 | 27.43 | 34.43 | 20.29 | 32.58 | 55.03 | 0.00 | 2530.55 | 10642.75 | 235340.70 | 0.00 | 247.81 | 1039.35 | 23046.55 | 0.00 | 5.26 | 9.82 | 5754.00 | 6265313.00 |
2020-05-16 | 0.00 | 0.21 | 0.81 | 21.50 | 26.78 | 32.52 | 19.59 | 31.86 | 54.23 | 0.00 | 2916.25 | 11209.00 | 256629.88 | 0.00 | 285.51 | 1095.29 | 25124.49 | 0.00 | 5.56 | 9.83 | 6292.00 | 6271605.00 | ||
2020-05-17 | 0.00 | 0.24 | 1.00 | 21.21 | 26.69 | 35.25 | 20.38 | 32.74 | 63.15 | 0.00 | 3000.41 | 11416.43 | 288039.82 | 0.00 | 293.47 | 1114.81 | 28172.85 | 0.00 | 4.99 | 9.82 | 7045.00 | 6278650.00 | ||
2020-05-18 | 0.00 | 0.16 | 0.97 | 20.96 | 23.85 | 28.37 | 19.48 | 27.81 | 53.94 | 0.00 | 2125.32 | 12238.86 | 204030.30 | 0.00 | 208.03 | 1193.63 | 19970.51 | 0.00 | 4.99 | 9.83 | 4998.00 | 6283648.00 | ||
2020-05-19 | 0.00 | 0.19 | 0.84 | 22.39 | 25.34 | 30.37 | 20.06 | 29.73 | 51.85 | 0.00 | 2497.61 | 10854.50 | 232277.27 | 0.00 | 244.53 | 1059.80 | 22741.18 | 0.00 | 4.63 | 9.83 | 6449.00 | 6290097.00 | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
p2 | xoJJ8DcxJEcupym | 2020-06-13 | 0.00 | 0.22 | 0.93 | 22.20 | 26.12 | 31.91 | 20.88 | 30.39 | 51.01 | 0.00 | 285.87 | 1271.15 | 27443.74 | 0.00 | 279.59 | 1239.95 | 26840.58 | 0.00 | 51.91 | 98.29 | 6632.00 | 209312200.00 |
2020-06-14 | 0.00 | 0.23 | 0.92 | 23.65 | 27.02 | 32.96 | 22.12 | 31.59 | 52.99 | 0.00 | 320.51 | 1362.06 | 30768.78 | 0.00 | 313.43 | 1328.21 | 30088.97 | 0.00 | 51.94 | 99.92 | 7268.00 | 209319687.00 | ||
2020-06-15 | 0.00 | 0.19 | 0.83 | 24.00 | 26.56 | 31.61 | 23.00 | 30.00 | 48.46 | 0.00 | 266.64 | 1288.28 | 25597.24 | 0.00 | 260.93 | 1256.67 | 25049.49 | 0.00 | 52.94 | 98.27 | 7412.67 | 209325949.00 | ||
2020-06-16 | 0.00 | 0.17 | 0.77 | 23.63 | 26.37 | 30.83 | 22.56 | 29.59 | 46.36 | 0.00 | 232.66 | 1124.97 | 22335.69 | 0.00 | 227.82 | 1098.21 | 21870.63 | 0.00 | 50.90 | 98.29 | 6203.20 | 209331425.00 | ||
2020-06-17 | 0.00 | 0.12 | 0.58 | 22.55 | 24.85 | 29.04 | 21.91 | 26.67 | 42.33 | 0.00 | 183.75 | 828.77 | 17640.42 | 0.00 | 179.98 | 810.77 | 17278.51 | 0.00 | 52.90 | 98.32 | 5327.00 | 209335741.00 |
1496 rows × 22 columns
Now we need to pass plant and inverter_id to columns, and leave the date as the index.
df_dia = df_dia.reset_index().set_index('fecha')
df_dia
planta | inverter_id | irradiacion_min | irradiacion_mean | irradiacion_max | t_ambiente_min | t_ambiente_mean | t_ambiente_max | t_modulo_min | t_modulo_mean | t_modulo_max | kw_dc_min | kw_dc_mean | kw_dc_max | kw_dc_sum | kw_ac_min | kw_ac_mean | kw_ac_max | kw_ac_sum | eficiencia_min | eficiencia_mean | eficiencia_max | kw_dia_max | kw_total_max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | ||||||||||||||||||||||||
2020-05-15 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.20 | 0.89 | 22.04 | 27.43 | 34.43 | 20.29 | 32.58 | 55.03 | 0.00 | 2530.55 | 10642.75 | 235340.70 | 0.00 | 247.81 | 1039.35 | 23046.55 | 0.00 | 5.26 | 9.82 | 5754.00 | 6265313.00 |
2020-05-16 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.21 | 0.81 | 21.50 | 26.78 | 32.52 | 19.59 | 31.86 | 54.23 | 0.00 | 2916.25 | 11209.00 | 256629.88 | 0.00 | 285.51 | 1095.29 | 25124.49 | 0.00 | 5.56 | 9.83 | 6292.00 | 6271605.00 |
2020-05-17 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.24 | 1.00 | 21.21 | 26.69 | 35.25 | 20.38 | 32.74 | 63.15 | 0.00 | 3000.41 | 11416.43 | 288039.82 | 0.00 | 293.47 | 1114.81 | 28172.85 | 0.00 | 4.99 | 9.82 | 7045.00 | 6278650.00 |
2020-05-18 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.16 | 0.97 | 20.96 | 23.85 | 28.37 | 19.48 | 27.81 | 53.94 | 0.00 | 2125.32 | 12238.86 | 204030.30 | 0.00 | 208.03 | 1193.63 | 19970.51 | 0.00 | 4.99 | 9.83 | 4998.00 | 6283648.00 |
2020-05-19 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.19 | 0.84 | 22.39 | 25.34 | 30.37 | 20.06 | 29.73 | 51.85 | 0.00 | 2497.61 | 10854.50 | 232277.27 | 0.00 | 244.53 | 1059.80 | 22741.18 | 0.00 | 4.63 | 9.83 | 6449.00 | 6290097.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-13 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.22 | 0.93 | 22.20 | 26.12 | 31.91 | 20.88 | 30.39 | 51.01 | 0.00 | 285.87 | 1271.15 | 27443.74 | 0.00 | 279.59 | 1239.95 | 26840.58 | 0.00 | 51.91 | 98.29 | 6632.00 | 209312200.00 |
2020-06-14 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.23 | 0.92 | 23.65 | 27.02 | 32.96 | 22.12 | 31.59 | 52.99 | 0.00 | 320.51 | 1362.06 | 30768.78 | 0.00 | 313.43 | 1328.21 | 30088.97 | 0.00 | 51.94 | 99.92 | 7268.00 | 209319687.00 |
2020-06-15 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.19 | 0.83 | 24.00 | 26.56 | 31.61 | 23.00 | 30.00 | 48.46 | 0.00 | 266.64 | 1288.28 | 25597.24 | 0.00 | 260.93 | 1256.67 | 25049.49 | 0.00 | 52.94 | 98.27 | 7412.67 | 209325949.00 |
2020-06-16 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.17 | 0.77 | 23.63 | 26.37 | 30.83 | 22.56 | 29.59 | 46.36 | 0.00 | 232.66 | 1124.97 | 22335.69 | 0.00 | 227.82 | 1098.21 | 21870.63 | 0.00 | 50.90 | 98.29 | 6203.20 | 209331425.00 |
2020-06-17 | p2 | xoJJ8DcxJEcupym | 0.00 | 0.12 | 0.58 | 22.55 | 24.85 | 29.04 | 21.91 | 26.67 | 42.33 | 0.00 | 183.75 | 828.77 | 17640.42 | 0.00 | 179.98 | 810.77 | 17278.51 | 0.00 | 52.90 | 98.32 | 5327.00 | 209335741.00 |
1496 rows × 24 columns
We already have our hourly and daily datasets ready.
We keep them.
df.to_pickle('df.pickle')
df_dia.to_pickle('df_dia.pickle')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#Automcompletar rápido
%config IPCompleter.greedy=True
#Formato de display
pd.options.display.float_format = '{:15.2f}'.format
#Formato de graficos
import fiser_tools as fs
fs.misc.dark_theme()
df = pd.read_pickle('df.pickle')
df.head()
planta | mes | dia | hora | minuto | sensor_id | irradiacion | t_ambiente | t_modulo | inverter_id | kw_dc | kw_ac | eficiencia | kw_dia | kw_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||||||||
2020-05-15 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 0.00 | 6259559.00 |
2020-05-15 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 0.00 | 6183645.00 |
2020-05-15 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 0.00 | 6987759.00 |
2020-05-15 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 0.00 | 7602960.00 |
2020-05-15 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 0.00 | 7158964.00 |
df_dia = pd.read_pickle('df_dia.pickle')
df_dia.head()
planta | inverter_id | irradiacion_min | irradiacion_mean | irradiacion_max | t_ambiente_min | t_ambiente_mean | t_ambiente_max | t_modulo_min | t_modulo_mean | ... | kw_dc_sum | kw_ac_min | kw_ac_mean | kw_ac_max | kw_ac_sum | eficiencia_min | eficiencia_mean | eficiencia_max | kw_dia_max | kw_total_max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||||||||||||||
2020-05-15 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.20 | 0.89 | 22.04 | 27.43 | 34.43 | 20.29 | 32.58 | ... | 235340.70 | 0.00 | 247.81 | 1039.35 | 23046.55 | 0.00 | 5.26 | 9.82 | 5754.00 | 6265313.00 |
2020-05-16 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.21 | 0.81 | 21.50 | 26.78 | 32.52 | 19.59 | 31.86 | ... | 256629.88 | 0.00 | 285.51 | 1095.29 | 25124.49 | 0.00 | 5.56 | 9.83 | 6292.00 | 6271605.00 |
2020-05-17 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.24 | 1.00 | 21.21 | 26.69 | 35.25 | 20.38 | 32.74 | ... | 288039.82 | 0.00 | 293.47 | 1114.81 | 28172.85 | 0.00 | 4.99 | 9.82 | 7045.00 | 6278650.00 |
2020-05-18 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.16 | 0.97 | 20.96 | 23.85 | 28.37 | 19.48 | 27.81 | ... | 204030.30 | 0.00 | 208.03 | 1193.63 | 19970.51 | 0.00 | 4.99 | 9.83 | 4998.00 | 6283648.00 |
2020-05-19 | p1 | 1BY6WEcLGh8j5v7 | 0.00 | 0.19 | 0.84 | 22.39 | 25.34 | 30.37 | 20.06 | 29.73 | ... | 232277.27 | 0.00 | 244.53 | 1059.80 | 22741.18 | 0.00 | 4.63 | 9.83 | 6449.00 | 6290097.00 |
5 rows × 24 columns
The first lever is the reception of solar energy.
We have 3 kpis with which to measure this lever: incoming irradiation, ambient temperature, and module temperature.
These kpis are measured with a single sensor per plant, so the data is the same for all inverters.
We need to understand how these variables work with each other before moving on to see how they interact with the next level.
Since the inverter does not matter and we only need those 3 variables, we are going to create a smaller dataset with only one inverter from each plant to work on.
df
planta | mes | dia | hora | minuto | sensor_id | irradiacion | t_ambiente | t_modulo | inverter_id | kw_dc | kw_ac | eficiencia | kw_dia | kw_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||||||||
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 0.00 | 6259559.00 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 0.00 | 6183645.00 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 0.00 | 6987759.00 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 0.00 | 7602960.00 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 0.00 | 7158964.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 0.00 | 4157.00 | 520758.00 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 0.00 | 3931.00 | 121131356.00 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 0.00 | 4322.00 | 2427691.00 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | xMbIugepa2P7lBB | 0.00 | 0.00 | 0.00 | 4218.00 | 106896394.00 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 0.00 | 4316.00 | 209335741.00 |
136472 rows × 15 columns
recepcion = df.loc[(df.inverter_id == '1BY6WEcLGh8j5v7') | (df.inverter_id == 'q49J1IKaHRwDQnt'), 'planta':'t_modulo']
recepcion
planta | mes | dia | hora | minuto | sensor_id | irradiacion | t_ambiente | t_modulo | |
---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 |
2020-05-15 00:15:00 | p1 | 5 | 15 | 0 | 15 | HmiyD2TTLFNqkNe | 0.00 | 25.08 | 22.76 |
2020-05-15 00:30:00 | p1 | 5 | 15 | 0 | 30 | HmiyD2TTLFNqkNe | 0.00 | 24.94 | 22.59 |
2020-05-15 00:45:00 | p1 | 5 | 15 | 0 | 45 | HmiyD2TTLFNqkNe | 0.00 | 24.85 | 22.36 |
2020-05-15 01:00:00 | p1 | 5 | 15 | 1 | 0 | HmiyD2TTLFNqkNe | 0.00 | 24.62 | 22.17 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-17 22:45:00 | p2 | 6 | 17 | 22 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.51 | 22.86 |
2020-06-17 23:00:00 | p2 | 6 | 17 | 23 | 0 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.48 | 22.74 |
2020-06-17 23:15:00 | p2 | 6 | 17 | 23 | 15 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.35 | 22.49 |
2020-06-17 23:30:00 | p2 | 6 | 17 | 23 | 30 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.29 | 22.37 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 |
6413 rows × 9 columns
temp = recepcion.groupby('planta').agg({'irradiacion':sum,'t_ambiente':np.mean,'t_modulo':np.mean})
temp
irradiacion | t_ambiente | t_modulo | |
---|---|---|---|
planta | |||
p1 | 726.40 | 25.56 | 31.18 |
p2 | 758.49 | 28.07 | 32.77 |
f, ax = plt.subplots(nrows=1, ncols=3, figsize = (18,5))
ax[0].bar(temp.index, temp.irradiacion, color = ['red','blue'], alpha = 0.3)
ax[1].bar(temp.index, temp.t_ambiente, color = ['red','blue'], alpha = 0.3)
ax[2].bar(temp.index, temp.t_modulo, color = ['red','blue'], alpha = 0.3)
ax[0].set_title('Irradiación por planta')
ax[1].set_title('Temperatura ambiente por planta')
ax[2].set_title('Temperatura módulo por planta');
Conclusions:
temp = recepcion.loc[:,['planta','irradiacion','t_ambiente','t_modulo']]
temp
planta | irradiacion | t_ambiente | t_modulo | |
---|---|---|---|---|
fecha | ||||
2020-05-15 00:00:00 | p1 | 0.00 | 25.18 | 22.86 |
2020-05-15 00:15:00 | p1 | 0.00 | 25.08 | 22.76 |
2020-05-15 00:30:00 | p1 | 0.00 | 24.94 | 22.59 |
2020-05-15 00:45:00 | p1 | 0.00 | 24.85 | 22.36 |
2020-05-15 01:00:00 | p1 | 0.00 | 24.62 | 22.17 |
... | ... | ... | ... | ... |
2020-06-17 22:45:00 | p2 | 0.00 | 23.51 | 22.86 |
2020-06-17 23:00:00 | p2 | 0.00 | 23.48 | 22.74 |
2020-06-17 23:15:00 | p2 | 0.00 | 23.35 | 22.49 |
2020-06-17 23:30:00 | p2 | 0.00 | 23.29 | 22.37 |
2020-06-17 23:45:00 | p2 | 0.00 | 23.20 | 22.54 |
6413 rows × 4 columns
sns.heatmap(temp.corr(), annot=True);
sns.pairplot(temp.reset_index(), hue = 'planta', height=3, plot_kws={'alpha': 0.1});
Conclusions:
temp = pd.crosstab(recepcion.hora,recepcion.planta,values = recepcion.irradiacion,aggfunc='mean')
temp
planta | p1 | p2 |
---|---|---|
hora | ||
0 | 0.00 | 0.00 |
1 | 0.00 | 0.00 |
2 | 0.00 | 0.00 |
3 | 0.00 | 0.00 |
4 | 0.00 | 0.00 |
5 | 0.00 | 0.00 |
6 | 0.04 | 0.04 |
7 | 0.19 | 0.19 |
8 | 0.37 | 0.39 |
9 | 0.53 | 0.57 |
10 | 0.64 | 0.69 |
11 | 0.73 | 0.76 |
12 | 0.74 | 0.79 |
13 | 0.69 | 0.69 |
14 | 0.58 | 0.60 |
15 | 0.46 | 0.44 |
16 | 0.29 | 0.28 |
17 | 0.13 | 0.12 |
18 | 0.02 | 0.02 |
19 | 0.00 | 0.00 |
20 | 0.00 | 0.00 |
21 | 0.00 | 0.00 |
22 | 0.00 | 0.00 |
23 | 0.00 | 0.00 |
plt.figure(figsize=(10,10))
sns.heatmap(temp, annot=True, fmt=".2f");
temp = pd.crosstab(recepcion.hora,recepcion.planta,values = recepcion.t_ambiente,aggfunc='mean')
temp
planta | p1 | p2 |
---|---|---|
hora | ||
0 | 22.80 | 25.20 |
1 | 22.63 | 24.82 |
2 | 22.46 | 24.48 |
3 | 22.32 | 24.27 |
4 | 22.17 | 24.07 |
5 | 22.06 | 23.91 |
6 | 22.20 | 24.19 |
7 | 23.34 | 25.48 |
8 | 24.92 | 27.05 |
9 | 26.48 | 28.61 |
10 | 27.65 | 30.17 |
11 | 28.80 | 31.43 |
12 | 29.62 | 32.55 |
13 | 30.10 | 33.01 |
14 | 30.29 | 33.16 |
15 | 29.99 | 32.81 |
16 | 29.38 | 32.44 |
17 | 28.29 | 31.59 |
18 | 26.66 | 30.00 |
19 | 25.36 | 28.61 |
20 | 24.50 | 27.63 |
21 | 23.87 | 26.75 |
22 | 23.26 | 26.06 |
23 | 22.92 | 25.46 |
plt.figure(figsize=(10,10))
sns.heatmap(temp, annot=True, fmt=".1f");
Conclusions:
plt.figure(figsize = (12,8))
sns.scatterplot(data = df, x = df.irradiacion, y = df.kw_dc);
There are 2 clearly different patterns. Could it be the plants?
plt.figure(figsize = (12,8))
sns.scatterplot(data = df, x = df.irradiacion, y = df.kw_dc, hue = 'planta');
Plant number 2 produces much less kW at the same irradiation levels.
But before we had seen that the relationship between dc and ac in plant 1 was strange.
And also that the data for dc and ac did not match those of kw_dia.
There is something strange in the data.
Let's see the relationship between irradiation and kw_dia to see if it gives us light.
plt.figure(figsize = (12,10))
sns.scatterplot(data = df, x = df.irradiacion, y = df.kw_dia, hue = 'planta');
It's very strange. It seems that the relationship is that the more irradiation, the less kW generated. Which doesn't make sense.
It even seems that the kw maximums occur in hours of zero irradiation.
Can you imagine what could be happening?
BEWARE: the variable kw_dia is a CUMULATIVE. That means that it should reach its maximum when the last hour of the day arrives, for example 23:45, where obviously the irradiation is zero.
And not have data until after 7, which is when we see that there is irradiation.
Let's check it out.
df.groupby('hora')[['kw_dia']].mean().plot.bar();
Again something doesn't add up. There is generation between 00 and 06.
And also after 6:00 p.m. it begins to decline, which should not happen if it is accumulated.
Conclusion:
We don't trust these cumulative variables like kw_day and kw_total.
But the truth is that we don't trust the others much either.
In a real situation I would stop the project until I was able to see what happens with the data.
But in order to continue we are going to assume that the dc and ac data are correct.
And under this assumption we will obtain our conclusions.
INSIGHT #1
Plant 2 generates much lower levels of DC even at similar levels of irradiation
We can use the df_dia to plot the global vision of DC generation during the analysis period.
plt.figure(figsize = (10,8))
sns.lineplot(data = df_dia.reset_index(), x = df_dia.reset_index().fecha, y = 'kw_dc_sum', hue = 'planta');
We see that plant 1 has much more variability while plant 2 is much more constant.
But above all we are surprised by the low levels of DC generation on plant 2 compared to 1.
Let's examine the generation of each day to see if we see something strange.
We generate a date variable to be able to add by it.
df['date'] = df.index.date
df
planta | mes | dia | hora | minuto | sensor_id | irradiacion | t_ambiente | t_modulo | inverter_id | kw_dc | kw_ac | eficiencia | kw_dia | kw_total | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | ||||||||||||||||
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 1BY6WEcLGh8j5v7 | 0.00 | 0.00 | 0.00 | 0.00 | 6259559.00 | 2020-05-15 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 1IF53ai7Xc0U56Y | 0.00 | 0.00 | 0.00 | 0.00 | 6183645.00 | 2020-05-15 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 3PZuoBAID5Wc2HD | 0.00 | 0.00 | 0.00 | 0.00 | 6987759.00 | 2020-05-15 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | 7JYdWkrLSPkdwr4 | 0.00 | 0.00 | 0.00 | 0.00 | 7602960.00 | 2020-05-15 |
2020-05-15 00:00:00 | p1 | 5 | 15 | 0 | 0 | HmiyD2TTLFNqkNe | 0.00 | 25.18 | 22.86 | McdE0feGgRqW7Ca | 0.00 | 0.00 | 0.00 | 0.00 | 7158964.00 | 2020-05-15 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 0.00 | 4157.00 | 520758.00 | 2020-06-17 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 0.00 | 3931.00 | 121131356.00 | 2020-06-17 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 0.00 | 4322.00 | 2427691.00 | 2020-06-17 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | xMbIugepa2P7lBB | 0.00 | 0.00 | 0.00 | 4218.00 | 106896394.00 | 2020-06-17 |
2020-06-17 23:45:00 | p2 | 6 | 17 | 23 | 45 | iq8k7ZNt4Mwm3w0 | 0.00 | 23.20 | 22.54 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 0.00 | 4316.00 | 209335741.00 | 2020-06-17 |
136472 rows × 16 columns
We create a temporal dataframe to analyze the hourly DC generation on each day at plant 1.
dc_constante_p1 = df[df.planta == 'p1'].groupby(['planta','date','hora']).kw_dc.sum()
dc_constante_p1
planta date hora
p1 2020-05-15 0 0.00
1 0.00
2 0.00
3 0.00
4 0.00
...
2020-06-17 19 0.00
20 0.00
21 0.00
22 0.00
23 0.00
Name: kw_dc, Length: 796, dtype: float64
We create a temporal dataframe to analyze the hourly DC generation on each day at plant 1.
dc_constante_p1.unstack(level = 1).plot(subplots = True, layout = (17,2), sharex=True, figsize=(20,30));
Conclusions:
We repeated the analysis on plant 2
dc_constante_p2 = df[df.planta == 'p2'].groupby(['planta','date','hora']).kw_dc.sum()
dc_constante_p2
planta date hora
p2 2020-05-15 0 0.00
1 0.00
2 0.00
3 0.00
4 0.00
...
2020-06-17 19 0.00
20 0.00
21 0.00
22 0.00
23 0.00
Name: kw_dc, Length: 816, dtype: float64
We are going to pass date to columns, to be able to represent each column (which are the dates) as a variable and therefore as an independent graph.
dc_constante_p2.unstack(level = 1).plot(subplots = True, layout = (17,2), sharex=True, figsize=(20,30));
Conclusions:
INSIGHT #2: The low levels of plant 2 are constant and have daily curves that seem normal.
sns.scatterplot(data = df, x = df.kw_dc, y = df.kw_ac, hue = df.planta);
Again the patterns are very clear: plant 2 transforms the current much more efficiently.
We are going to expand by analyzing the efficiency variable that we had created.
temp = df.groupby(['planta','hora'],as_index = False).eficiencia.mean()
temp
planta | hora | eficiencia | |
---|---|---|---|
0 | p1 | 0 | 0.00 |
1 | p1 | 1 | 0.00 |
2 | p1 | 2 | 0.00 |
3 | p1 | 3 | 0.00 |
4 | p1 | 4 | 0.00 |
5 | p1 | 5 | 0.00 |
6 | p1 | 6 | 9.20 |
7 | p1 | 7 | 9.80 |
8 | p1 | 8 | 9.81 |
9 | p1 | 9 | 9.78 |
10 | p1 | 10 | 9.77 |
11 | p1 | 11 | 9.75 |
12 | p1 | 12 | 9.71 |
13 | p1 | 13 | 9.68 |
14 | p1 | 14 | 9.76 |
15 | p1 | 15 | 9.79 |
16 | p1 | 16 | 9.81 |
17 | p1 | 17 | 9.76 |
18 | p1 | 18 | 6.53 |
19 | p1 | 19 | 0.00 |
20 | p1 | 20 | 0.00 |
21 | p1 | 21 | 0.00 |
22 | p1 | 22 | 0.00 |
23 | p1 | 23 | 0.00 |
24 | p2 | 0 | 0.00 |
25 | p2 | 1 | 0.00 |
26 | p2 | 2 | 0.00 |
27 | p2 | 3 | 0.00 |
28 | p2 | 4 | 0.00 |
29 | p2 | 5 | 0.58 |
30 | p2 | 6 | 92.93 |
31 | p2 | 7 | 97.39 |
32 | p2 | 8 | 97.42 |
33 | p2 | 9 | 91.70 |
34 | p2 | 10 | 76.20 |
35 | p2 | 11 | 68.35 |
36 | p2 | 12 | 68.16 |
37 | p2 | 13 | 73.04 |
38 | p2 | 14 | 83.43 |
39 | p2 | 15 | 95.69 |
40 | p2 | 16 | 95.85 |
41 | p2 | 17 | 95.56 |
42 | p2 | 18 | 73.52 |
43 | p2 | 19 | 0.00 |
44 | p2 | 20 | 0.00 |
45 | p2 | 21 | 0.00 |
46 | p2 | 22 | 0.00 |
47 | p2 | 23 | 0.00 |
sns.lineplot(data = temp, x = 'hora', y = 'eficiencia', hue = 'planta');
INSIGHT #3
Plant 1 has a very low capacity to transform DC to AC, which suggests problems with the inverters
Other conclusions:
We are going to start with the second, comparing the production of DC with that of AC in plant 2.
temp = df[['planta','hora','kw_dc','kw_ac']].melt(id_vars= ['planta','hora'])
temp
planta | hora | variable | value | |
---|---|---|---|---|
0 | p1 | 0 | kw_dc | 0.00 |
1 | p1 | 0 | kw_dc | 0.00 |
2 | p1 | 0 | kw_dc | 0.00 |
3 | p1 | 0 | kw_dc | 0.00 |
4 | p1 | 0 | kw_dc | 0.00 |
... | ... | ... | ... | ... |
272939 | p2 | 23 | kw_ac | 0.00 |
272940 | p2 | 23 | kw_ac | 0.00 |
272941 | p2 | 23 | kw_ac | 0.00 |
272942 | p2 | 23 | kw_ac | 0.00 |
272943 | p2 | 23 | kw_ac | 0.00 |
272944 rows × 4 columns
plt.figure(figsize = (12,8))
sns.lineplot(data = temp[temp.planta == 'p2'], x = 'hora', y = 'value', hue = 'variable', ci = False);
We see that indeed in the central hours there is a loss of efficiency. But nowhere near the level of loss that we had seen in the previous analysis.
We are going to analyze the distribution of efficiency in those hours.
temp = df.between_time('08:00:00','15:00:00')
temp = temp[temp.planta == 'p2']
temp.eficiencia.plot.density();
There is a data set with zero efficiency, and that is what causes the problem. But what is the cause of that zero efficiency?
We are going to select those cases and review them.
temp[temp.kw_dc == 0]
planta | mes | dia | hora | minuto | sensor_id | irradiacion | t_ambiente | t_modulo | inverter_id | kw_dc | kw_ac | eficiencia | kw_dia | kw_total | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | ||||||||||||||||
2020-05-15 09:45:00 | p2 | 5 | 15 | 9 | 45 | iq8k7ZNt4Mwm3w0 | 0.80 | 31.38 | 45.72 | 81aHJ1q11NBPMrL | 0.00 | 0.00 | 0.00 | 1645.00 | 1215280381.00 | 2020-05-15 |
2020-05-15 09:45:00 | p2 | 5 | 15 | 9 | 45 | iq8k7ZNt4Mwm3w0 | 0.80 | 31.38 | 45.72 | Et9kgGMDl729KT4 | 0.00 | 0.00 | 0.00 | 1541.00 | 1705791.00 | 2020-05-15 |
2020-05-15 09:45:00 | p2 | 5 | 15 | 9 | 45 | iq8k7ZNt4Mwm3w0 | 0.80 | 31.38 | 45.72 | Quc1TzYxW2pYoWX | 0.00 | 0.00 | 0.00 | 1324.00 | 329510409.00 | 2020-05-15 |
2020-05-15 09:45:00 | p2 | 5 | 15 | 9 | 45 | iq8k7ZNt4Mwm3w0 | 0.80 | 31.38 | 45.72 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 0.00 | 1735.00 | 209145328.00 | 2020-05-15 |
2020-05-15 10:00:00 | p2 | 5 | 15 | 10 | 0 | iq8k7ZNt4Mwm3w0 | 0.83 | 31.89 | 46.13 | 81aHJ1q11NBPMrL | 0.00 | 0.00 | 0.00 | 1645.00 | 1215280381.00 | 2020-05-15 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-16 14:30:00 | p2 | 6 | 16 | 14 | 30 | iq8k7ZNt4Mwm3w0 | 0.56 | 30.83 | 44.38 | q49J1IKaHRwDQnt | 0.00 | 0.00 | 0.00 | 0.00 | 34379.33 | 2020-06-16 |
2020-06-16 14:30:00 | p2 | 6 | 16 | 14 | 30 | iq8k7ZNt4Mwm3w0 | 0.56 | 30.83 | 44.38 | rrq4fwE8jgrTyWY | 0.00 | 0.00 | 0.00 | 0.00 | 8075096.33 | 2020-06-16 |
2020-06-16 14:30:00 | p2 | 6 | 16 | 14 | 30 | iq8k7ZNt4Mwm3w0 | 0.56 | 30.83 | 44.38 | vOuJvMaM2sgwLmb | 0.00 | 0.00 | 0.00 | 0.00 | 173022.79 | 2020-06-16 |
2020-06-16 14:30:00 | p2 | 6 | 16 | 14 | 30 | iq8k7ZNt4Mwm3w0 | 0.56 | 30.83 | 44.38 | xMbIugepa2P7lBB | 0.00 | 0.00 | 0.00 | 0.00 | 7126074.93 | 2020-06-16 |
2020-06-16 14:30:00 | p2 | 6 | 16 | 14 | 30 | iq8k7ZNt4Mwm3w0 | 0.56 | 30.83 | 44.38 | xoJJ8DcxJEcupym | 0.00 | 0.00 | 0.00 | 0.00 | 13955359.33 | 2020-06-16 |
3667 rows × 16 columns
Parece que no es problema del inverter, si no de que en esos momentos no se ha generado DC.
Vamos a poner la condición de que DC > 0 y ver ahí cual es la eficiencia.
temp[temp.kw_dc > 0].eficiencia.plot.density();
Indeed when there is DC the efficiency is higher than 96%.
The question then is why is there no DC? Is there a pattern?
Let's create a DC = 0 flag so we can analyze it.
temp['kw_dc_cero'] = np.where(temp['kw_dc'] == 0, 1, 0)
temp
planta | mes | dia | hora | minuto | sensor_id | irradiacion | t_ambiente | t_modulo | inverter_id | kw_dc | kw_ac | eficiencia | kw_dia | kw_total | date | kw_dc_cero | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||||||||||
2020-05-15 08:00:00 | p2 | 5 | 15 | 8 | 0 | iq8k7ZNt4Mwm3w0 | 0.46 | 27.68 | 35.71 | 4UPUqMRk7TRMgml | 581.05 | 569.41 | 98.00 | 554.00 | 2429565.00 | 2020-05-15 | 0 |
2020-05-15 08:00:00 | p2 | 5 | 15 | 8 | 0 | iq8k7ZNt4Mwm3w0 | 0.46 | 27.68 | 35.71 | 81aHJ1q11NBPMrL | 534.67 | 524.09 | 98.02 | 516.60 | 1215279252.60 | 2020-05-15 | 0 |
2020-05-15 08:00:00 | p2 | 5 | 15 | 8 | 0 | iq8k7ZNt4Mwm3w0 | 0.46 | 27.68 | 35.71 | 9kRcWv60rDACzjR | 568.53 | 557.16 | 98.00 | 551.80 | 2247720128.80 | 2020-05-15 | 0 |
2020-05-15 08:00:00 | p2 | 5 | 15 | 8 | 0 | iq8k7ZNt4Mwm3w0 | 0.46 | 27.68 | 35.71 | Et9kgGMDl729KT4 | 526.24 | 515.83 | 98.02 | 503.00 | 1704753.00 | 2020-05-15 | 0 |
2020-05-15 08:00:00 | p2 | 5 | 15 | 8 | 0 | iq8k7ZNt4Mwm3w0 | 0.46 | 27.68 | 35.71 | IQ2d7wF4YD8zU1Q | 578.17 | 566.58 | 97.99 | 552.79 | 19942078.79 | 2020-05-15 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-17 15:00:00 | p2 | 6 | 17 | 15 | 0 | iq8k7ZNt4Mwm3w0 | 0.36 | 27.23 | 29.35 | q49J1IKaHRwDQnt | 562.34 | 551.13 | 98.01 | 3380.87 | 519981.87 | 2020-06-17 | 0 |
2020-06-17 15:00:00 | p2 | 6 | 17 | 15 | 0 | iq8k7ZNt4Mwm3w0 | 0.36 | 27.23 | 29.35 | rrq4fwE8jgrTyWY | 534.27 | 523.71 | 98.02 | 3190.40 | 121130615.40 | 2020-06-17 | 0 |
2020-06-17 15:00:00 | p2 | 6 | 17 | 15 | 0 | iq8k7ZNt4Mwm3w0 | 0.36 | 27.23 | 29.35 | vOuJvMaM2sgwLmb | 578.85 | 567.15 | 97.98 | 3534.07 | 2426903.07 | 2020-06-17 | 0 |
2020-06-17 15:00:00 | p2 | 6 | 17 | 15 | 0 | iq8k7ZNt4Mwm3w0 | 0.36 | 27.23 | 29.35 | xMbIugepa2P7lBB | 568.86 | 557.49 | 98.00 | 3433.50 | 106895609.50 | 2020-06-17 | 0 |
2020-06-17 15:00:00 | p2 | 6 | 17 | 15 | 0 | iq8k7ZNt4Mwm3w0 | 0.36 | 27.23 | 29.35 | xoJJ8DcxJEcupym | 578.49 | 566.90 | 98.00 | 3534.67 | 209334959.67 | 2020-06-17 | 0 |
20376 rows × 17 columns
We will start with the numerical variables.
temp.groupby('kw_dc_cero')[['irradiacion','t_ambiente','t_modulo']].mean()
irradiacion | t_ambiente | t_modulo | |
---|---|---|---|
kw_dc_cero | |||
0 | 0.59 | 30.48 | 44.32 |
1 | 0.80 | 32.48 | 52.03 |
At room temperature there is not much difference, but at module temperature and irradiation yes.
Could it be that if it gets too hot the module stops generating DC?
Let's see it by comparing the module temperature with the DC generation.
sns.scatterplot(data = temp, x = 't_modulo', y = 'kw_dc',hue = 'kw_dc_cero');
The previous hypothesis is not confirmed, since there are many cases of high temperatures where DC is generated, and also of kw_dc equal to zero in almost all temperature ranges.
We are now going to analyze the categorical ones, starting with the inverter.
temp.groupby('inverter_id').kw_dc_cero.mean().sort_values(ascending = False).plot.bar();
There is a big difference in the percentage of zero DC production per inverter.
From some that have less than 5% to some that exceed 30%.
INSIGHT #4:: On plant 2 there are several inverters that are not getting enough DC production, and therefore whose modules need revision.
We are going to analyze the inverters from the point of view of the average efficiency to see if there are "good and bad".
temp[temp.kw_dc > 0].groupby(['inverter_id','date'],as_index = False).eficiencia.mean().boxplot(column = 'eficiencia', by = 'inverter_id', figsize = (14,10))
plt.xticks(rotation = 90);
INSIGHT #5:: Once discounting the problem of not generating DC, the inverters of plant 2 do work well and do the job of transformation to AC well.
To finish analyzing the efficiency of the inverters, we can see their performance on each of the days to see if there may have been specific problems.
temp[temp.kw_dc > 0].groupby(['inverter_id','date']).eficiencia.mean().unstack(level = 0).plot(subplots = True, sharex=True, figsize=(20,40))
plt.xticks(rotation = 90);
To have a term of comparison we are going to repeat the analyzes with plant 1.
temp = df.between_time('08:00:00','15:00:00')
temp = temp[temp.planta == 'p1']
temp['kw_dc_cero'] = np.where(temp['kw_dc'] == 0, 1, 0)
temp
planta | mes | dia | hora | minuto | sensor_id | irradiacion | t_ambiente | t_modulo | inverter_id | kw_dc | kw_ac | eficiencia | kw_dia | kw_total | date | kw_dc_cero | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||||||||||
2020-05-15 08:00:00 | p1 | 5 | 15 | 8 | 0 | HmiyD2TTLFNqkNe | 0.20 | 25.42 | 31.41 | 1BY6WEcLGh8j5v7 | 3246.00 | 318.67 | 9.82 | 263.57 | 6259822.57 | 2020-05-15 | 0 |
2020-05-15 08:00:00 | p1 | 5 | 15 | 8 | 0 | HmiyD2TTLFNqkNe | 0.20 | 25.42 | 31.41 | 1IF53ai7Xc0U56Y | 2805.62 | 275.46 | 9.82 | 292.50 | 6183937.50 | 2020-05-15 | 0 |
2020-05-15 08:00:00 | p1 | 5 | 15 | 8 | 0 | HmiyD2TTLFNqkNe | 0.20 | 25.42 | 31.41 | 3PZuoBAID5Wc2HD | 2736.12 | 268.62 | 9.82 | 287.75 | 6988046.75 | 2020-05-15 | 0 |
2020-05-15 08:00:00 | p1 | 5 | 15 | 8 | 0 | HmiyD2TTLFNqkNe | 0.20 | 25.42 | 31.41 | 7JYdWkrLSPkdwr4 | 2741.50 | 269.15 | 9.82 | 281.00 | 7603241.00 | 2020-05-15 | 0 |
2020-05-15 08:00:00 | p1 | 5 | 15 | 8 | 0 | HmiyD2TTLFNqkNe | 0.20 | 25.42 | 31.41 | McdE0feGgRqW7Ca | 3208.88 | 315.05 | 9.82 | 291.00 | 7159255.00 | 2020-05-15 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-06-17 15:00:00 | p1 | 6 | 17 | 15 | 0 | HmiyD2TTLFNqkNe | 0.32 | 28.62 | 39.95 | uHbuxQJl8lW7ozc | 4719.62 | 462.89 | 9.81 | 5423.50 | 7286458.50 | 2020-06-17 | 0 |
2020-06-17 15:00:00 | p1 | 6 | 17 | 15 | 0 | HmiyD2TTLFNqkNe | 0.32 | 28.62 | 39.95 | wCURE6d3bPkepu2 | 5077.75 | 497.80 | 9.80 | 5343.75 | 7028061.75 | 2020-06-17 | 0 |
2020-06-17 15:00:00 | p1 | 6 | 17 | 15 | 0 | HmiyD2TTLFNqkNe | 0.32 | 28.62 | 39.95 | z9Y9gH1T5YWrNuG | 5113.00 | 501.29 | 9.80 | 5282.88 | 7250667.88 | 2020-06-17 | 0 |
2020-06-17 15:00:00 | p1 | 6 | 17 | 15 | 0 | HmiyD2TTLFNqkNe | 0.32 | 28.62 | 39.95 | zBIq5rxdHJRwDNY | 4675.38 | 458.61 | 9.81 | 5284.75 | 6582836.75 | 2020-06-17 | 0 |
2020-06-17 15:00:00 | p1 | 6 | 17 | 15 | 0 | HmiyD2TTLFNqkNe | 0.32 | 28.62 | 39.95 | zVJPv84UY57bAof | 4853.75 | 475.96 | 9.81 | 5368.75 | 7362730.75 | 2020-06-17 | 0 |
21450 rows × 17 columns
temp.eficiencia.plot.density();
We see that no, here all the inverters have a constant efficiency (although very low)
temp.groupby(['inverter_id','date'],as_index = False).eficiencia.mean().boxplot(column = 'eficiencia', by = 'inverter_id', figsize = (14,10))
plt.xticks(rotation = 90);
We see that except for specific days in some inverters, in the rest the efficiency is constant.
We are going to review the average daily efficiency for each inverter.
temp.groupby(['inverter_id','date']).eficiencia.mean().unstack(level = 0).plot(subplots = True, sharex=True, figsize=(20,40))
plt.xticks(rotation = 90);
In the inverter analysis we see again that all the data are constant.
We are going to verify that then there are no failures in the generation of DC.
temp.groupby('inverter_id').kw_dc_cero.mean().sort_values(ascending = False).plot.bar();
We see that although there are some inverters that have had failures, their magnitude is less than 2% of the measurements.
Therefore the generation of DC in plant 1 is correct, and the failure is in the transformation from DC to AC.
After an analysis of the data we can conclude that:
Recommendations: