Once More, with Data Cleaning
Posted on October 17, 2018 in articles
I completed Georgetown University's Data Science Certificate Program in 2017, and now that my Python and machine learning skills have improved, I'd like to look at the data with fresh eyes and try to build better models.
While my initial classification models had F1 scores from 0.81 to 1.0, the R² from regression models were always quite low, and even negative in some cases. It was an intensive program, and I was working full-time, so I'm certain I missed things and could have done better. But to paraphrase George Eliot, it's never too late for your machine learning models to be what they might have been!
Classroom Occupancy: An Introduction¶
I worked on this Capstone project with four other students and my primary role was cleaning the data and building the models. Our projected goal was to create a web application that could incorporate supervised machine learning models to predict a room's occupancy level (or number of occupants when using regression models) based on real-time sensor data. We used Raspberry Pi 3 and other devices to capture sensor data from our classroom including, CO2, temperature, humidity, images, sound, light, and bluetooth devices. We also placed a sensor on the room's only door to capture each time it opened and closed. One of my team members also created a Flask app, but I'll save that for later.
To start, I'll ingest the data and clean each sensor separately, since this will make it easier to explore the data and to identify potential outliers. So let's begin!
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.dates as md
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from yellowbrick.target import ClassBalance
sns.set()
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
EDA Functions¶
def plotBoxplots(df, column, title):
"""Plot boxplots for each day of data."""
# Create dataframe with a daily PeriodIndex
df_period = df.to_period(freq='D')
# Boxplots for daily sensor data
_, ax = plt.subplots(figsize=(14,10))
sns.boxplot(
x=df_period.index, y=column, data=df_period,
ax=ax, palette=sns.color_palette('RdBu', 10))
ax.set_title(title, fontsize=22)
ax.set_xlabel('Date', fontsize=14)
ax.set_ylabel(title, fontsize=14)
# Create list of dates for the x-ticks labels
labels = (
['March 25', 'April 1', 'April 8', 'April 22', 'April 29',
'May 5', 'May 6', 'May 12', 'May 13', 'June 3', 'June 10'])
ax.set_xticklabels(labels, size=12) # Set x-ticks labels
plt.show() # Show plot
def plotHistogram(df, column, title):
"""Plot a histogram of the sensor data."""
# Calculate number of hist bins
n_data = len(df[column])
n_bins = int(np.sqrt(n_data))
_, ax = plt.subplots()
ax.hist(df[column], bins=n_bins, range=(df[column].min(), df[column].max()))
ax.set_title(title, size=18) # Set the plot title
ax.set_ylabel('Count') # Set the ylabel
# Set the x-ticks and y-ticks parameters
ax.xaxis.set_major_formatter(tkr.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y, p: format(int(y), ',')))
plt.show()
Data Ingestion¶
The sensor_data.csv
file contains 46,275 observations and contains the following features: location, location_CO2, temperature, humidity, sound, bluetooth devices, and non-personal bluetooth devices.
# Read csv file in as a Pandas DataFrame with a DateTimeIndex: df
df = pd.read_csv('../data/sensor_data.csv', index_col='datetime', parse_dates=True)
# Rename the columns
df.columns = [
'loc', 'loc2', 'temperature', 'humidity', 'co2', 'light',
'sound', 'bluetooth_devices', 'nonpersonal_bluetooth']
df.info()
When the project first started, our group had debated the idea of gathering data from other classrooms, which is why the dataset contains the first two location columns. However, since it only contains data from our classroom, I can drop these columns.
I'll also drop the nonpersonal_bluetooth
feature, which was not added until April 8th, so there are two days of missing data. And, as you can see below, it is highly correlated with bluetooth_devices
.
# Jointplot of bluetooth devices and non-personal bluetooth devices
sns.jointplot(
x='bluetooth_devices', y='nonpersonal_bluetooth',
data=df.dropna(), kind='reg')
plt.xlabel('Bluetooth Devices', fontsize=14)
plt.ylabel('Non-Personal Bluetooth Devices', fontsize=14)
plt.show()
# Delete the non-personal bluetooth and two location columns
df.drop(['loc', 'loc2', 'nonpersonal_bluetooth'], axis=1, inplace=True)
Dataset Overview¶
Grouping by day shows that we have data for 11 class periods. Each day has a different number of values, since it wasn't possible to begin and end recording at the exact time each day. In addition, Friday night classes were only 3 hours long, from 6:30 PM to 9:30 PM EST, while Saturday classes were 7 hours, from 9:00 AM to 4:00 PM EST. This is why there are significantly fewer observations on May 5th, May 12th, and June 3rd.
# Use groupby to see which days are captured in the data
df.groupby(df.index.strftime('%D')).count()
Data Wrangling¶
The initial exploratory data analysis showed that the several of the sensors had generated multiple error values. Most often, these readings occurred at the beginning and end of the day when the devices were turned on and off. However, there were also cases when the sensors would restart, causing either error values and creating gaps of missing data.
Some of the error values were easy to identify, such as a temperature reading of -999.99 °Celsius. However others, such as a CO₂ value of 2, required further investigation, since without any domain knowledge, I couldn’t be certain if 2 was a low, but accurate reading.
Temperature Data¶
Type of Sensor: AM2303 (captures both temperature and humidity)
Sensor Range: -40 to 125°C
Reference: OSHA recommends temperature control in the range of 20-24.4°C (68-76°F).
# Create temperature dataframe with a DateTimeIndex
temperature_data = df[['temperature']].copy()
The temperature data has two -999 values. Below, we can see that on April 8th it was the first value of the day, while the second occurred on May 6th during the first hour of class. In addition, because the AM2303 sensor captures both temperature and relative humidity, these same -999 errors allow show up in the humidity data.
# Summary statistics for the temperature data
temperature_data.describe()
# Show when the first -999 value was recorded
temperature_data['2017-04-08'].head()
# Show when the second -999 temperature value was recorded
temperature_data['2017-05-06 09:22:05':'2017-05-06 09:22:20']
Since even Neptune, the coldest planet in our solar system (sorry Pluto!), has an average temperature of -214°C, the -999 values are obviously errors generated by the sensor, so I'll delete them.
# Delete two -999 values from the temperature data
temperature_data = temperature_data[temperature_data['temperature'] != -999]
# Updated summary statistics for the temperature data
temperature_data.describe()
# Plot boxplots of the daily temperature data
plotBoxplots(temperature_data, 'temperature', 'Temperature °C')
The boxplots show single outliers for certain days, such as for March 25th, May 13th, and June 10th. Further investigation found that these values are from the initial sensor reading of that particular day (please see below). While the difference between the first reading and the following values may not seem as significant at first, the dataframe's standard deviation is only 1.4, so it is important to address them. Since the values are relatively close, I will delete them here and they will be backfilled later when I concatenate the data back into a single DataFrame.
# Look at the first value from April 22, 2017
temperature_data.loc['2017-04-22'].head()
# Look at the first temperature value from May 13, 2017
temperature_data['2017-05-13'].head()
# Look at the first temperature value from June 10, 2017
temperature_data['2017-06-10'].head()
Since the first temperature values recorded for each class period were outliers generated by the sensor, I'll delete those from the data.
# Group the temperature data by day
temp_data = temperature_data.groupby(temperature_data.index.date).head(1)
# Drop the first temperature value for each day
temperature_data = temperature_data.drop(temp_data[temp_data == 1].index)
# Updated boxplots of the daily temperature data
plotBoxplots(temperature_data, 'temperature', 'Temperature °C')
# Plot a histogram of the temperature data
plotHistogram(temperature_data, 'temperature', 'Temperature °C')
Resample Temperature Data¶
The temperature data was original taken at 5-second intervals, so I'll resample it now 1-minute frequency by taking the mean of the surrounding values.
# Resample temperature data
temperature = temperature_data.resample('T').mean().dropna()
# Updated temperature summary statistics
temperature.describe()
Humidity¶
Type of Sensor: AM2303
Sensor Range: 0-100% RH
Reference: OSHA recommends humidity control in the range of 20%-60%.
# Create humidity dataframe with DateTimeIndex: humidity_data
humidity_data = df[['humidity']].copy()
Our sensor captured both temperature and humidity together, which is why the humidity data also contains two -999 values.
# Summary statistics for the humidity data
humidity_data.describe()
# Delete two -999 values
humidity_data = humidity_data[humidity_data['humidity'] != -999]
# Updated humidity statistics
humidity_data.describe()
# Plot a histogram of the humidity data
plotHistogram(humidity_data, 'humidity', 'Humidity Data')
# Plot boxplots of the daily humidity data
plotBoxplots(humidity_data, 'humidity', 'Humidity Data')
# TO DO: Look into why the data April 8 differs so greatly from the others.
# Resample the humidity data
humidity = humidity_data.resample('T').mean().dropna()
# Updated humidity summary statistics
humidity.describe()
CO₂ Data¶
Type of Sensor: COZIR Ambient GC-0010 Sensor
Sensor Range: 0-2000 parts per million (ppm)
Reference: OSHA recommends keeping indoor CO₂ levels below 1000 ppm.
# Create CO2 dataframe with DateTimeIndex: co2_data
co2_data = df[['co2']].copy()
The summary statistics report a mean of 1236.25 and a standard deviation of 178.46. However, the 25% quartile has a range of 2 to 1110.
While the -999 temperature and humidity values were obviously errors, without any domain knowledge, I didn't initially know if a CO₂ value of 2 was caused by a sensor error or was a low, but accurate, reading.
# Summary statistics for the CO2 data
co2_data.describe()
The sensor randomly generated thirty-three error 2 values, and while they did not only occur when the sensor was turned on, they did also occur at that time.
# Identify how many 2 values are in the data
len(co2_data[co2_data['co2'] == 2])
# Look at the first CO2 value from April 8, 2017
co2_data.loc['2017-04-08'].head()
# Look at the first CO2 value from June 10, 2017
co2_data.loc['2017-06-10'].head()
# Delete 2 values
co2_data = co2_data[co2_data['co2'] != 2]
# Updated CO2 statistics
co2_data.describe()
On April 1, 2017, the CO₂ sensor readings suddenly spiked up to 2001, as you can see below. The sensor was only supposed to have a range of 0-2000 ppm.
# Create figure and axes
fig, ax = plt.subplots(figsize=(16,8))
# Plot spike in CO2 level on April 1, 2017
ax.plot(co2_data.loc['April 1, 2017'])
# Add title and labels
ax.set_title('Spike in CO2 Level on April 1, 2017', fontsize=18)
ax.set_ylabel('CO2 Level (ppm)', fontsize=16, weight='bold')
ax.set_xlabel('Time of Day', fontsize=16)
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M %p'))
ax.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y, p: format(int(y), ',')))
# Create inset
ax = plt.axes([.58, .55, .3, .3], facecolor='w')
# Plot inset showing the spike in CO2 values
ax.plot(co2_data['co2'].loc['2017-04-01 09:55:00':'2017-04-01 10:38:00'].index,
co2_data['co2'].loc['2017-04-01 09:55:00':'2017-04-01 10:38:00'], 'g', linewidth=2.0)
# Format inset ticks
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M'))
ax.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y, p: format(int(y), ',')))
# Show plot
plt.show()
To remove the error values caused by the spike, I decided to delete the values above 1628. I choose that value by looking at the max value for the other days, since they did not have any similar spikes, and the highest value was 1628 on June 10th.
# Delete error CO2 values above 1628
co2_data = co2_data[co2_data['co2'] <= 1628]
# Plot a histogram of the updated CO2 data
plotHistogram(co2_data, 'co2', 'CO2 Data')
# Plot boxplots of the daily CO2 data
plotBoxplots(co2_data, 'co2', 'CO2 Data')
# Resample CO2 data to per-minute interval
co2 = co2_data.resample('T').mean().dropna()
Sound Data¶
Type of Sensor: Electret Microphone Amplifier MAX4466
Sensor Range: 0 to 20K Hz
Reference: Human speech frequencies are in the range of 500 Hz to 4,000 Hz. A young person with normal hearing can hear frequencies between approximately 20 Hz and 20,000 Hz.
# Create noise dataframe with DateTimeIndex: sound_data
sound_data = df[['sound']].copy()
# Summary statistics for the sound data
sound_data.describe()
# Plot a histogram of the sound data
plotHistogram(sound_data, 'sound', 'Sound Level (Hz)')
# Plot Boxplots for daily sound data
plotBoxplots(sound_data, 'sound', 'Sound Data (Hz)')
# ToDo: explore sound outliers
# Resample sound data
sound = sound_data.resample('T').mean().dropna()
Light Data¶
Type of Sensor: Photoresistor GL5537
Reference: Illuminance is measured in foot candles or lux (in the metric SI system). GSA recommends a nominal illumination level (Lumens/Square Meter lux) of 300 for conference rooms, or 500 Lux in work station space, open and closed offices, and in training rooms.
For Reference: The sensor has a light resistance of 10 Lux (30 to 50 kohm).
# Create light dataframe with DateTimeIndex: light_data
light_data = df[['light']].copy()
# Summary statistics for the light data
light_data.describe()
The light sensor generated several large error values or 0 readings when it would restart. In addition, it also generated error values at the end of the day when it was turned off. In the following plot of light data on May 5th, the light values never went over 400 lux until the final seconds of the day.
# Identify the high error values
light_max = light_data[light_data['light'] > 4000]
light_max['light']
While a light value of 0 is possible, it's unlikely since even with the classroom lights turned off, there still would have been light from the hallway. In addition, I concluded that these 0 values were errors since they were isolated readings, as can be seen below.
The light sensor generated several large error values or 0 readings when it would restart. In addition, it also generated error values at the end of the day when it was turned off. In the following plot of light data on May 5th, the light values never went over 400 lux until the final seconds of the day.
# Look at 0 light reading on March 25, 2017
light_data.light['March 25, 2017 11:48:20':'March 25, 2017 11:49:00']
# Plot light data for May 5, 2017
fig, ax = plt.subplots()
ax.plot(light_data.loc['May 5, 2017 21:20:00':'May 5, 2017 22:00:00'])
ax.set_title('Light Data: May 5, 2017', fontsize=16)
ax.set_xlabel('Time of Day', fontsize=14)
ax.set_ylabel('Light, lux', fontsize=14)
ax.tick_params(labelsize=12)
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M %p'))
plt.show()
# Identify how many 0 light values are in the data
len(light_data[light_data['light'] == 0])
# Delete error 0 light values
light_data = light_data[light_data['light'] != 0]
# Delete error light values
light_data = light_data[light_data['light'] < 4000]
# Plot updated histogram of light data
plotHistogram(light_data, 'light', 'Light Level (Lux)')
While a light value of 0 is possible, it's unlikely since even with the classroom lights turned off, there still would have been light from the hallway. In addition, I concluded that these 0 values were errors since they were isolated readings, as can be seen below.
# Look at 0 light reading on March 25, 2017
light_data.light['March 25, 2017 11:48:20':'March 25, 2017 11:49:00']
# Plot updated box plots for light data
plotBoxplots(light_data, 'light', 'Light Level (Lux)')
# ToDo: look into the remaining outliers
# Resample light data
light = light_data.resample('T').mean().dropna()
Bluetooth Devices¶
The Raspberry Pi 3 Model B comes with its own built in wifi and Bluetooth Low Energy (LE), so we were able to scan for the number of bluetooth devices. Since we were not able to limit the area we scanned to only our classroom, the sensor picked up devices from other classrooms. However, since we were in the same classroom each day, we still thought it could be a predictive variable.
# Create bluetooth devices dataframe with DateTimeIndex: bluetooth_data
bluetooth_data = df[['bluetooth_devices']].copy()
# Summary statistics for the bluetooth data
bluetooth_data.describe()
Looking at the summary statistics for bluetooth_devices
, I was surprised that the minimum value was zero, since each of the students setting up the sensors had multiple devices emitting bluetooth signals.
When I looked into the data further, I found that similar to several other sensors, the first value recorded for each day was incorrect. Here, on May 5th, we can see that the first value is zero and immediately jumps to 30 and continues to rise.
# Look at the lowest values for May 5th
bluetooth_data['bluetooth_devices']['May 5, 2017'].head(10)
Unfortunately, unlike the other sensors, I found that when the bluetooth sensor was first started or when it randomly restarted, it would take a significant amount of time for it build back up. This can be seen by plotting the bluetooth_devices
values for May 5, 2017. All through the morning, the number of devices is gradually increasing, until it suddenly restarts and drops to zero, and then begins slowing rising again.
# Plot the bluetooth values for May 5, 2017
fig, ax = plt.subplots(figsize=(12,6))
bluetooth_data['bluetooth_devices']['2017-04-01'].plot()
# Add a title and labels
ax.set_title('Bluetooth Values for May 5, 2017', fontsize=20)
ax.set_xlabel('Time of Day', fontsize=14)
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M %p'))
ax.set_ylabel('No. of Bluetooth Devices', fontsize=14)
# Show the plot
plt.show()
Another serious issue I discovered was that there were long periods of time when the sensor only recorded zeros. For example, on April 29th, only zero values were recorded for almost an hour of half.
# Identify the number of 0 values
fig, ax = plt.subplots(figsize=(12,6))
bluetooth_data['bluetooth_devices']['2017-04-29'].plot()
# Add a title and labels
ax.set_title('Bluetooth Values for April 29, 2017', fontsize=20)
ax.set_xlabel('Time of Day', fontsize=14)
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M %p'))
ax.set_ylabel('No. of Bluetooth Devices', fontsize=14)
# Show the plot
plt.show()
Unfortunately, given these issues, I won't be able to use the bluetooth data that we collected. At first I had considered different methods to fill the missing data and zero values. However, after plotting the data, I don't believe that the sensor accurately captured the number of devices.
I know from my own experience on campus that a large number of students and professors arrive in the morning, leave for an hour for lunch, and then stay in the building until the end of the day. However, the plots show a small number of devices each morning and then a gradual increase throughout the day. Furthermore, there is no decline during lunchtime. Our building only had one small coffee shop that sold a small selection of drinks and snacks, so to get lunch you had to leave the building. So for these reasons, I won't be able to use this feature to build my predictive models.
Images¶
Type of Sensor: Raspberry Pi Camera Module v2
Initially, to capture the ground truth for our dataset, our team setup an 8-megapixel camera with a wide-angle lens to take pictures of the classroom at one-minute intervals. Unfortunately, due to time restraints, we were not able to use them for that purpose. Furthermore, when we asked our fellow students for their permission to take pictures at the beginning of the project, we had promised them that the pictures would be deleted at the end of the semester, so they are no longer available.
However, our team was still able to capture data from the images by calculating the "closeness" based on the root-mean-square error (RMSE) between successive images. For identical images, the difference between them would be zero, so higher values indicate a greater divergence.
# Create image dataframe with DateTimeIndex: image_data
image_data = pd.read_csv('../data/image_variations.csv', index_col='datetime', parse_dates=True)
# Rename the rolling_rms column to image_rms
image_data.rename({'rolling_rms':'image_rms'}, axis=1, inplace=True)
# Summary statistics for the image data
image_data.describe()
# View the first five rows of the image data
image_data.head()
Since we are calculating the change between successive images, the first value for each class period is zero.
# Identify the 0 values in the image data
image_data[image_data['image_rms'] == 0]
So I'll delete the zero values now and then backfill them when concatenating the image data. In addition, I can delete the control_F_rms
and control_L_rms
columns now, since they are no longer needed.
# Delete zero values in the image data
image_data = image_data[image_data['image_rms'] != 0]
# Delete the control_F_rms and control_L_rms columns from the image data
image_data.drop(['control_F_rms', 'control_L_rms'], axis=1, inplace=True)
# Plot histogram of daily image data
plotHistogram(image_data, 'image_rms', 'Daily Image Data')
# Plot boxplots for daily image data
plotBoxplots(image_data, 'image_rms', 'Daily Image Data')
# ToDo: investigate outliers
In the following chart, you can see that the largest changes occurred during our class breaks, lunch, and at the end of the day. Typically, our professors gave us a 5-10 minute break in the morning between 10:00 am and 10:30 am and in the afternoon between 2:00 pm and 2:30 pm. Our lunch break was an hour long and most often began at 12:00 pm. Class ended approximately at 4:00 pm, but typically some students stayed afterwards to talk with the professor or to meet with the teammates, so we tried to stay until 4:30 pm to capture these changes.
# Create the figure and the axes
fig, ax = plt.subplots(figsize=(16,8))
# Plot image data for May 6, 2017
ax.plot(image_data.loc['May 6, 2017'])
# Add title and labels
ax.set_title('Image Data: May 6, 2017', fontsize=22)
ax.set_xlabel('Time of Day', fontsize=16)
ax.set_ylabel('% Change in Hist', fontsize=16)
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M %p'))
# Show plot
plt.show()
Since the images were taken each minute, I don't need to resample this DataFrame.
Occupancy Data¶
Our team created a Flask web application to log the number of people in the classroom. During breaks, one team member would use it to log each person who entered or exited the room. The app also included an "Empty Room" option that set the counter to 0. This was used most often used during the lunch break when there was no one from our group in the classroom to keep track of the number of occupants and at the end of the day. While we had originally intended to use the images to determine the total number of students in the classroom, we ultimately had to use the data generated by the Flask app for our model's target variable.
# Create occupancy count dataframe with DateTimeIndex: occupancy_data
occupancy_data = pd.read_csv('../data/occupancy_data.csv', index_col='datetime', parse_dates=True)
# Delete the unnecessary columns
occupancy_data.drop(['location', 'count_operation', 'count_change'], axis=1, inplace=True)
# Drop data for March 18, 2017
occupancy_data.drop(occupancy_data.loc['2017-03-18'].index, inplace=True)
# Occupancy summary statistics
occupancy_data.describe()
The occupancy
data contains 21 zero values. When I looked into the source of these values, I found that eleven of them were the last values of the day. For example, you can see below that at the end of the class period on March 25th, the number of occupants had started to decline as students left for the day, until suddenly the number went from 13 to 0. The reason for this sharp decline would have been, that since it was the end of the day, one of our team members would have clicked the "Empty Room" option on the Flask app.
# Identify the zero value at the end of March 25, 2017
occupancy_data['2017-03-25'].tail()
The remaining zero values occurred when the students left for lunch. Our team tried to stay in the room until it was empty, but sometimes a few students would remain and we needed to take our break. You can see this below as the number starts to dwindle just after noon, until the number of occupants drops from 4 to zero. Since I want to keep the last value when resampling the occupancy data, I decided to delete these zero values since they don't accurately reflect the room's occupancy level at the time.
# Identify the zero value during lunch time on April 29, 2017
occupancy_data['2017-04-29 12:03']
# Delete the 0 values from the occupancy data
occupancy_data = occupancy_data[occupancy_data['count_total'] != 0]
Resample Occupancy Data¶
Most often, when students were entering or leaving the classroom, a large number of them would do so at the same time. For example, when leaving for lunch or coming back from a break. Therefore, the Flask app kept track of changes in milliseconds.
However, I did not want to resample the occupancy data by taking the mean of the values per-minute, like the other sensor values. To understand why, below you can see that for one minute, at 8:57 am on April 29th, we had a total of 19 values recorded. If we were just take the mean of that data, it would be 9.68. However, since this was happening at a time when a large number of students were coming into the room, the room's occupancy went from 1 to 17 at that time. Therefore, when resampling the data, a more accurate measure would be to keep the last value during that time frame, which in this case would be 17.
# Show the values for April 29 at 8:57 am
occupancy_data['2017-04-29 08:57']
# Resample occupancy data per minute and drop any NaN values
occupancy = occupancy_data.resample(rule='T').last().dropna()
# Plot boxplots for daily occupancy data
plotBoxplots(occupancy, 'count_total', 'Daily Occupancy')
Concatenate Sensor Data¶
Now that I've finished cleaning and resampling the individual sensor values (with the exception of the door data, which will be explained later), I'll concatenate them into a single DataFrame, backfilling any missing values.
# Concatenate cleaned sensor data in a new dataframe: sensor_data
sensor_data = pd.concat(
[temperature, humidity, co2, light, sound, image_data, occupancy], axis=1).fillna(method='bfill').dropna()
sensor_data.info()
Door Status¶
The door sensor is an asynchronistic categorical feature with two possible values: closed
and opened
.
# Create door data dataframe with DateTimeIndex: door_data
door_data = pd.read_csv('../data/door_data.csv', index_col='datetime', parse_dates=True)
# Drop the location column
door_data.drop('location', axis=1, inplace=True)
# View summary information about the door data
door_data.info()
# View the head of the door data
door_data.head()
# Count the number of each category value
door_data['door_status'].value_counts()
It has roughly an equal number of both values, since each time the door opens it must also close. At first, I considered using Pandas get_dummies()
function to encode the variable, but this naturally led to two highly correlated variables, since it is almost always a single event. So first, I decided to encode door_status
using scikit-learn's LabelEncoder
.
# Encode the door feature
le = LabelEncoder()
door_data['door_status'] = le.fit_transform(door_data['door_status'])
Then to resample the data, I want to sum the number of the times the door had opened each minute, rather than calculating the mean value. Unfortunately, Pandas resample()
function created over 100K extra rows containing zero values, since it sums values for every minute, regardless of whether there was any data for that point in time. So after resampling the data, I deleted these extra rows.
# Resample door status data
door = door_data.resample('T').sum()
# Delete rows containing zeros
door.drop(door[door.door_status == 0].index, inplace=True)
# View the head of the door data
door.head()
# Concatenate sensor and door data in a new dataframe: sensor_data
sensor_data = pd.concat([sensor_data, door], axis=1).fillna(method='bfill')
# Rename the dataframe columns
sensor_data.columns = ['temp', 'humidity', 'co2', 'light', 'sound', 'images', 'occupancy', 'door']
sensor_data.head()
# Summary statistics for sensor data
sensor_data.describe()
Create Category Variable¶
Finally, I'll create a new variable by splitting occupancy
into different levels. This will be my target variable when building classification models, while I'll use occupancy
for regression models.
Unfortunately, the target variable is highly imbalanced toward high occupancy, which I did expect since the classroom was largely filled throughout the day, except during breaks and lunchtime. Our team did try to balance our dataset by leaving our sensors in the room overnight, but our plan was thwarted by a well-meaning cleaning lady.
# Create target array by slicing 'occupancy_count' column: occupancy_level
sensor_data['occupancy_level'] = pd.cut(
sensor_data['occupancy'], [0, 13, 25, 31], labels=['low', 'mid-level', 'high'], include_lowest=True
)
sensor_data.info()
# Show the balance of classes
visualizer = ClassBalance()
visualizer.fit(sensor_data['occupancy_level'])
visualizer.show()
Save Data¶
I'll save the updated data into two different CSV files, one for building my classification models and one for my regression models. In the classification dataset, I won't include the occupancy
variable since it used to create the target variable occupancy_level
. Likewise, in the regression dataset I'll keep the occupancy
variable and not occupancy_level
, since occupancy
will be my target variable when building my regression models.
# Rearrange columns for the classification dataset
classification_data = sensor_data[[
'temp', 'humidity', 'co2', 'light', 'sound', 'images', 'door', 'occupancy_level']
]
# Export updated classification data to a CSV file: classification_data.csv
classification_data.to_csv('../data/classification_data.csv')
# Rearrange columns for the regression dataset
regression_data = sensor_data[['temp', 'humidity', 'co2', 'light', 'sound', 'images', 'door', 'occupancy']]
# Export updated regression sensor data to a CSV file: regression_data.csv
regression_data.to_csv('../data/regression_data.csv')
So that's all for now, but join me next time as I try out one of Yellowbrick's newest visualizers, DiscriminationThreshold
, and finally see if my latest data cleaning efforts result in more predictive models this time around for my Capstone project.
And if you've read this far then you don't get to make fun of me for getting excited about such things!