I completed Georgetown University's Data Science Certificate Program in 2017, and now that my Python and machine learning skills have improved, I'd like to look at the data with fresh eyes and try to build better models.

While my initial classification models had F1 scores from 0.81 to 1.0, the R² from regression models were always quite low, and even negative in some cases. It was an intensive program, and I was working full-time, so I'm certain I missed things and could have done better. But to paraphrase George Eliot, it's never too late for your machine learning models to be what they might have been!

Classroom Occupancy: An Introduction¶

I worked on this Capstone project with four other students and my primary role was cleaning the data and building the models. Our projected goal was to create a web application that could incorporate supervised machine learning models to predict a room's occupancy level (or number of occupants when using regression models) based on real-time sensor data. We used Raspberry Pi 3 and other devices to capture sensor data from our classroom including, CO2, temperature, humidity, images, sound, light, and bluetooth devices. We also placed a sensor on the room's only door to capture each time it opened and closed. One of my team members also created a Flask app, but I'll save that for later.

To start, I'll ingest the data and clean each sensor separately, since this will make it easier to explore the data and to identify potential outliers. So let's begin!

In [1]:

%matplotlib inline

In [2]:

import numpy as np
import pandas as pd

import matplotlib.dates as md
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr
import seaborn as sns

from sklearn.preprocessing import LabelEncoder

from yellowbrick.target import ClassBalance

sns.set()

In [3]:

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

EDA Functions¶

In [4]:

def plotBoxplots(df, column, title):
    """Plot boxplots for each day of data."""
    
    # Create dataframe with a daily PeriodIndex
    df_period = df.to_period(freq='D')
    
    # Boxplots for daily sensor data
    _, ax = plt.subplots(figsize=(14,10))
    
    sns.boxplot(
        x=df_period.index, y=column, data=df_period,
        ax=ax, palette=sns.color_palette('RdBu', 10))
    
    ax.set_title(title, fontsize=22)
    ax.set_xlabel('Date', fontsize=14)
    ax.set_ylabel(title, fontsize=14)
    
    # Create list of dates for the x-ticks labels
    labels = (
        ['March 25', 'April 1', 'April 8', 'April 22', 'April 29',
         'May 5', 'May 6', 'May 12', 'May 13', 'June 3', 'June 10'])
    
    ax.set_xticklabels(labels, size=12)       # Set x-ticks labels
    plt.show()                                # Show plot

In [5]:

def plotHistogram(df, column, title):
    """Plot a histogram of the sensor data."""
    
    # Calculate number of hist bins
    n_data = len(df[column])
    n_bins = int(np.sqrt(n_data))
    
    _, ax = plt.subplots()
    ax.hist(df[column], bins=n_bins, range=(df[column].min(), df[column].max()))
    ax.set_title(title, size=18)       # Set the plot title
    ax.set_ylabel('Count')             # Set the ylabel
    
    # Set the x-ticks and y-ticks parameters
    ax.xaxis.set_major_formatter(tkr.FuncFormatter(lambda x,  p: format(int(x), ',')))
    ax.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y,  p: format(int(y), ',')))
    
    plt.show()

Data Ingestion¶

The sensor_data.csv file contains 46,275 observations and contains the following features: location, location_CO2, temperature, humidity, sound, bluetooth devices, and non-personal bluetooth devices.

In [6]:

# Read csv file in as a Pandas DataFrame with a DateTimeIndex: df
df = pd.read_csv('../data/sensor_data.csv', index_col='datetime', parse_dates=True)

# Rename the columns
df.columns = [
    'loc', 'loc2', 'temperature', 'humidity', 'co2', 'light',
    'sound', 'bluetooth_devices', 'nonpersonal_bluetooth']

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 46275 entries, 2017-03-25 09:05:58 to 2017-06-10 16:47:05
Data columns (total 9 columns):
loc                      46275 non-null object
loc2                     46275 non-null object
temperature              46275 non-null float64
humidity                 46275 non-null float64
co2                      46275 non-null float64
light                    46275 non-null float64
sound                    46275 non-null float64
bluetooth_devices        46275 non-null int64
nonpersonal_bluetooth    38187 non-null float64
dtypes: float64(6), int64(1), object(2)
memory usage: 3.5+ MB

When the project first started, our group had debated the idea of gathering data from other classrooms, which is why the dataset contains the first two location columns. However, since it only contains data from our classroom, I can drop these columns.

I'll also drop the nonpersonal_bluetooth feature, which was not added until April 8th, so there are two days of missing data. And, as you can see below, it is highly correlated with bluetooth_devices.

In [8]:

# Jointplot of bluetooth devices and non-personal bluetooth devices
sns.jointplot(
    x='bluetooth_devices', y='nonpersonal_bluetooth',
    data=df.dropna(), kind='reg')

plt.xlabel('Bluetooth Devices', fontsize=14)
plt.ylabel('Non-Personal Bluetooth Devices', fontsize=14)
plt.show()

In [9]:

# Delete the non-personal bluetooth and two location columns
df.drop(['loc', 'loc2', 'nonpersonal_bluetooth'], axis=1, inplace=True)

Dataset Overview¶

Grouping by day shows that we have data for 11 class periods. Each day has a different number of values, since it wasn't possible to begin and end recording at the exact time each day. In addition, Friday night classes were only 3 hours long, from 6:30 PM to 9:30 PM EST, while Saturday classes were 7 hours, from 9:00 AM to 4:00 PM EST. This is why there are significantly fewer observations on May 5th, May 12th, and June 3rd.

In [10]:

# Use groupby to see which days are captured in the data
df.groupby(df.index.strftime('%D')).count()

Out[10]:

	temperature	humidity	co2	light	sound	bluetooth_devices
03/25/17	3711	3711	3711	3711	3711	3711
04/01/17	4377	4377	4377	4377	4377	4377
04/08/17	4575	4575	4575	4575	4575	4575
04/22/17	5298	5298	5298	5298	5298	5298
04/29/17	5039	5039	5039	5039	5039	5039
05/05/17	2390	2390	2390	2390	2390	2390
05/06/17	5320	5320	5320	5320	5320	5320
05/12/17	2083	2083	2083	2083	2083	2083
05/13/17	5195	5195	5195	5195	5195	5195
06/03/17	2745	2745	2745	2745	2745	2745
06/10/17	5542	5542	5542	5542	5542	5542

Data Wrangling¶

The initial exploratory data analysis showed that the several of the sensors had generated multiple error values. Most often, these readings occurred at the beginning and end of the day when the devices were turned on and off. However, there were also cases when the sensors would restart, causing either error values and creating gaps of missing data.

Some of the error values were easy to identify, such as a temperature reading of -999.99 °Celsius. However others, such as a CO₂ value of 2, required further investigation, since without any domain knowledge, I couldn’t be certain if 2 was a low, but accurate reading.

Temperature Data¶

Type of Sensor: AM2303 (captures both temperature and humidity)  
Sensor Range: -40 to 125°C

Reference: OSHA recommends temperature control in the range of 20-24.4°C (68-76°F).

In [11]:

# Create temperature dataframe with a DateTimeIndex
temperature_data = df[['temperature']].copy()

The temperature data has two -999 values. Below, we can see that on April 8th it was the first value of the day, while the second occurred on May 6th during the first hour of class. In addition, because the AM2303 sensor captures both temperature and relative humidity, these same -999 errors allow show up in the humidity data.

In [12]:

# Summary statistics for the temperature data
temperature_data.describe()

Out[12]:

	temperature
count	46275.000000
mean	23.022280
std	6.845733
min	-999.000000
25%	22.400000
50%	22.900000
75%	23.200000
max	29.400000

In [13]:

# Show when the first -999 value was recorded
temperature_data['2017-04-08'].head()

Out[13]:

	temperature
datetime
2017-04-08 08:58:39	-999.0
2017-04-08 08:58:44	21.8
2017-04-08 08:58:50	21.8
2017-04-08 08:58:55	21.8
2017-04-08 08:59:00	21.8

In [14]:

# Show when the second -999 temperature value was recorded
temperature_data['2017-05-06 09:22:05':'2017-05-06 09:22:20']

Out[14]:

	temperature
datetime
2017-05-06 09:22:06	23.2
2017-05-06 09:22:11	23.1
2017-05-06 09:22:12	-999.0
2017-05-06 09:22:16	23.1
2017-05-06 09:22:17	22.7

Since even Neptune, the coldest planet in our solar system (sorry Pluto!), has an average temperature of -214°C, the -999 values are obviously errors generated by the sensor, so I'll delete them.

In [15]:

# Delete two -999 values from the temperature data
temperature_data = temperature_data[temperature_data['temperature'] != -999]

In [16]:

# Updated summary statistics for the temperature data
temperature_data.describe()

Out[16]:

	temperature
count	46273.000000
mean	23.066453
std	1.310251
min	20.700000
25%	22.400000
50%	22.900000
75%	23.200000
max	29.400000

In [17]:

# Plot boxplots of the daily temperature data
plotBoxplots(temperature_data, 'temperature', 'Temperature °C')

The boxplots show single outliers for certain days, such as for March 25th, May 13th, and June 10th. Further investigation found that these values are from the initial sensor reading of that particular day (please see below). While the difference between the first reading and the following values may not seem as significant at first, the dataframe's standard deviation is only 1.4, so it is important to address them. Since the values are relatively close, I will delete them here and they will be backfilled later when I concatenate the data back into a single DataFrame.

In [18]:

# Look at the first value from April 22, 2017
temperature_data.loc['2017-04-22'].head()

Out[18]:

	temperature
datetime
2017-04-22 08:35:29	25.8
2017-04-22 08:35:34	22.7
2017-04-22 08:35:39	22.6
2017-04-22 08:35:44	22.6
2017-04-22 08:35:49	22.6

In [19]:

# Look at the first temperature value from May 13, 2017
temperature_data['2017-05-13'].head()

Out[19]:

	temperature
datetime
2017-05-13 08:57:13	20.7
2017-05-13 08:57:18	21.4
2017-05-13 08:57:23	21.4
2017-05-13 08:57:28	21.4
2017-05-13 08:57:33	21.5

In [20]:

# Look at the first temperature value from June 10, 2017
temperature_data['2017-06-10'].head()

Out[20]:

	temperature
datetime
2017-06-10 09:03:10	24.8
2017-06-10 09:03:15	23.6
2017-06-10 09:03:20	23.6
2017-06-10 09:03:25	23.5
2017-06-10 09:03:30	23.5

Since the first temperature values recorded for each class period were outliers generated by the sensor, I'll delete those from the data.

In [21]:

# Group the temperature data by day
temp_data = temperature_data.groupby(temperature_data.index.date).head(1)

# Drop the first temperature value for each day
temperature_data = temperature_data.drop(temp_data[temp_data == 1].index)

In [22]:

# Updated boxplots of the daily temperature data
plotBoxplots(temperature_data, 'temperature', 'Temperature °C')

In [23]:

# Plot a histogram of the temperature data
plotHistogram(temperature_data, 'temperature', 'Temperature °C')

Resample Temperature Data¶

The temperature data was original taken at 5-second intervals, so I'll resample it now 1-minute frequency by taking the mean of the surrounding values.

In [24]:

# Resample temperature data 
temperature = temperature_data.resample('T').mean().dropna()

In [25]:

# Updated temperature summary statistics
temperature.describe()

Out[25]:

	temperature
count	4155.000000
mean	23.116235
std	1.427802
min	21.000000
25%	22.300000
50%	22.900000
75%	23.200000
max	29.350000

Humidity¶

Type of Sensor: AM2303
Sensor Range: 0-100% RH

Reference: OSHA recommends humidity control in the range of 20%-60%.

In [26]:

# Create humidity dataframe with DateTimeIndex: humidity_data
humidity_data = df[['humidity']].copy()

Our sensor captured both temperature and humidity together, which is why the humidity data also contains two -999 values.

In [27]:

# Summary statistics for the humidity data
humidity_data.describe()

Out[27]:

	humidity
count	46275.000000
mean	39.769657
std	9.471097
min	-999.000000
25%	37.800000
50%	40.100000
75%	45.100000
max	52.200000

In [28]:

# Delete two -999 values
humidity_data = humidity_data[humidity_data['humidity'] != -999]

In [29]:

# Updated humidity statistics
humidity_data.describe()

Out[29]:

	humidity
count	46273.000000
mean	39.814555
std	6.562350
min	20.800000
25%	37.800000
50%	40.100000
75%	45.100000
max	52.200000

In [30]:

# Plot a histogram of the humidity data
plotHistogram(humidity_data, 'humidity', 'Humidity Data')

In [31]:

# Plot boxplots of the daily humidity data
plotBoxplots(humidity_data, 'humidity', 'Humidity Data')

In [32]:

# TO DO: Look into why the data April 8 differs so greatly from the others.

In [33]:

# Resample the humidity data
humidity = humidity_data.resample('T').mean().dropna()

In [34]:

# Updated humidity summary statistics
humidity.describe()

Out[34]:

	humidity
count	4157.000000
mean	39.581197
std	6.547206
min	21.190909
25%	37.730000
50%	39.716667
75%	44.916667
max	50.730000

CO₂ Data¶

Type of Sensor: COZIR Ambient GC-0010 Sensor
Sensor Range: 0-2000 parts per million (ppm)

Reference: OSHA recommends keeping indoor CO₂ levels below 1000 ppm.

In [35]:

# Create CO2 dataframe with DateTimeIndex: co2_data
co2_data = df[['co2']].copy()

The summary statistics report a mean of 1236.25 and a standard deviation of 178.46. However, the 25% quartile has a range of 2 to 1110.

While the -999 temperature and humidity values were obviously errors, without any domain knowledge, I didn't initially know if a CO₂ value of 2 was caused by a sensor error or was a low, but accurate, reading.

In [36]:

# Summary statistics for the CO2 data
co2_data.describe()

Out[36]:

	co2
count	46275.00000
mean	1236.24510
std	178.36047
min	2.00000
25%	1110.00000
50%	1261.00000
75%	1355.00000
max	2001.00000

The sensor randomly generated thirty-three error 2 values, and while they did not only occur when the sensor was turned on, they did also occur at that time.

In [37]:

# Identify how many 2 values are in the data
len(co2_data[co2_data['co2'] == 2])

Out[37]:

In [38]:

# Look at the first CO2 value from April 8, 2017
co2_data.loc['2017-04-08'].head()

Out[38]:

	co2
datetime
2017-04-08 08:58:39	2.0
2017-04-08 08:58:44	792.0
2017-04-08 08:58:50	776.0
2017-04-08 08:58:55	763.0
2017-04-08 08:59:00	776.0

In [39]:

# Look at the first CO2 value from June 10, 2017
co2_data.loc['2017-06-10'].head()

Out[39]:

	co2
datetime
2017-06-10 09:03:10	2.0
2017-06-10 09:03:15	1290.0
2017-06-10 09:03:20	1310.0
2017-06-10 09:03:25	1293.0
2017-06-10 09:03:30	1333.0

In [40]:

# Delete 2 values
co2_data = co2_data[co2_data['co2'] != 2]

In [41]:

# Updated CO2 statistics
co2_data.describe()

Out[41]:

	co2
count	46242.000000
mean	1237.125903
std	175.348893
min	629.000000
25%	1110.000000
50%	1261.000000
75%	1355.000000
max	2001.000000

On April 1, 2017, the CO₂ sensor readings suddenly spiked up to 2001, as you can see below. The sensor was only supposed to have a range of 0-2000 ppm.

In [42]:

# Create figure and axes
fig, ax = plt.subplots(figsize=(16,8))

# Plot spike in CO2 level on April 1, 2017
ax.plot(co2_data.loc['April 1, 2017'])

# Add title and labels
ax.set_title('Spike in CO2 Level on April 1, 2017', fontsize=18)
ax.set_ylabel('CO2 Level (ppm)', fontsize=16, weight='bold')
ax.set_xlabel('Time of Day', fontsize=16)
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M %p'))
ax.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y,  p: format(int(y), ',')))

# Create inset
ax = plt.axes([.58, .55, .3, .3], facecolor='w')

# Plot inset showing the spike in CO2 values
ax.plot(co2_data['co2'].loc['2017-04-01 09:55:00':'2017-04-01 10:38:00'].index, 
        co2_data['co2'].loc['2017-04-01 09:55:00':'2017-04-01 10:38:00'], 'g', linewidth=2.0)

# Format inset ticks
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M'))
ax.yaxis.set_major_formatter(tkr.FuncFormatter(lambda y,  p: format(int(y), ',')))

# Show plot
plt.show()

To remove the error values caused by the spike, I decided to delete the values above 1628. I choose that value by looking at the max value for the other days, since they did not have any similar spikes, and the highest value was 1628 on June 10th.

In [43]:

# Delete error CO2 values above 1628
co2_data = co2_data[co2_data['co2'] <= 1628]

In [44]:

# Plot a histogram of the updated CO2 data
plotHistogram(co2_data, 'co2', 'CO2 Data')

In [45]:

# Plot boxplots of the daily CO2 data
plotBoxplots(co2_data, 'co2', 'CO2 Data')

In [46]:

# Resample CO2 data to per-minute interval
co2 = co2_data.resample('T').mean().dropna()

Sound Data¶

Type of Sensor: Electret Microphone Amplifier MAX4466  
Sensor Range: 0 to 20K Hz

Reference: Human speech frequencies are in the range of 500 Hz to 4,000 Hz. A young person with normal hearing can hear frequencies between approximately 20 Hz and 20,000 Hz.

In [47]:

# Create noise dataframe with DateTimeIndex: sound_data
sound_data = df[['sound']].copy()

In [48]:

# Summary statistics for the sound data
sound_data.describe()

Out[48]:

	sound
count	46275.000000
mean	283.936704
std	174.322552
min	0.000000
25%	144.000000
50%	145.000000
75%	495.000000
max	1023.000000

In [49]:

# Plot a histogram of the sound data
plotHistogram(sound_data, 'sound', 'Sound Level (Hz)')

In [50]:

# Plot Boxplots for daily sound data
plotBoxplots(sound_data, 'sound', 'Sound Data (Hz)')

In [51]:

# ToDo: explore sound outliers

In [52]:

# Resample sound data
sound = sound_data.resample('T').mean().dropna()

Light Data¶

Type of Sensor: Photoresistor GL5537

Reference: Illuminance is measured in foot candles or lux (in the metric SI system). GSA recommends a nominal illumination level (Lumens/Square Meter lux) of 300 for conference rooms, or 500 Lux in work station space, open and closed offices, and in training rooms.

For Reference: The sensor has a light resistance of 10 Lux (30 to 50 kohm).

In [53]:

# Create light dataframe with DateTimeIndex: light_data
light_data = df[['light']].copy()

In [54]:

# Summary statistics for the light data
light_data.describe()

Out[54]:

	light
count	4.627500e+04
mean	2.923096e+03
std	1.494275e+05
min	0.000000e+00
25%	1.780000e+02
50%	2.070000e+02
75%	3.830000e+02
max	1.000000e+07

The light sensor generated several large error values or 0 readings when it would restart. In addition, it also generated error values at the end of the day when it was turned off. In the following plot of light data on May 5th, the light values never went over 400 lux until the final seconds of the day.

In [55]:

# Identify the high error values
light_max = light_data[light_data['light'] > 4000]
light_max['light']

Out[55]:

datetime
2017-04-08 14:02:54    9306527.000
2017-04-22 11:56:44    7718174.000
2017-04-22 12:53:51    8308485.000
2017-04-22 14:10:58    8016883.000
2017-04-29 12:23:45    9999999.999
2017-04-29 14:33:24    9999999.999
2017-04-29 15:57:55    9999999.999
2017-04-29 16:03:05       4042.000
2017-04-29 16:03:10       4052.000
2017-05-06 11:13:38    8009105.000
2017-05-12 18:26:26    8707175.000
2017-05-12 20:46:56    8725352.000
2017-05-13 09:02:19    9135322.000
2017-06-10 10:57:31    8281044.000
2017-06-10 12:30:24    9273526.000
Name: light, dtype: float64

While a light value of 0 is possible, it's unlikely since even with the classroom lights turned off, there still would have been light from the hallway. In addition, I concluded that these 0 values were errors since they were isolated readings, as can be seen below.

The light sensor generated several large error values or 0 readings when it would restart. In addition, it also generated error values at the end of the day when it was turned off. In the following plot of light data on May 5th, the light values never went over 400 lux until the final seconds of the day.

In [56]:

# Look at 0 light reading on March 25, 2017
light_data.light['March 25, 2017 11:48:20':'March 25, 2017 11:49:00']

Out[56]:

datetime
2017-03-25 11:48:22    463.0
2017-03-25 11:48:28    454.0
2017-03-25 11:48:34    465.0
2017-03-25 11:48:40      0.0
2017-03-25 11:48:46    462.0
2017-03-25 11:48:52    461.0
2017-03-25 11:48:58    445.0
Name: light, dtype: float64

In [57]:

# Plot light data for May 5, 2017
fig, ax = plt.subplots()

ax.plot(light_data.loc['May 5, 2017 21:20:00':'May 5, 2017 22:00:00'])
ax.set_title('Light Data: May 5, 2017', fontsize=16)
ax.set_xlabel('Time of Day', fontsize=14)
ax.set_ylabel('Light, lux', fontsize=14)
ax.tick_params(labelsize=12)
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M %p'))
plt.show()

In [58]:

# Identify how many 0 light values are in the data
len(light_data[light_data['light'] == 0])

Out[58]:

In [59]:

# Delete error 0 light values
light_data = light_data[light_data['light'] != 0]

In [60]:

# Delete error light values
light_data = light_data[light_data['light'] < 4000]

In [61]:

# Plot updated histogram of light data
plotHistogram(light_data, 'light', 'Light Level (Lux)')

While a light value of 0 is possible, it's unlikely since even with the classroom lights turned off, there still would have been light from the hallway. In addition, I concluded that these 0 values were errors since they were isolated readings, as can be seen below.

In [62]:

# Look at 0 light reading on March 25, 2017
light_data.light['March 25, 2017 11:48:20':'March 25, 2017 11:49:00']

Out[62]:

datetime
2017-03-25 11:48:22    463.0
2017-03-25 11:48:28    454.0
2017-03-25 11:48:34    465.0
2017-03-25 11:48:46    462.0
2017-03-25 11:48:52    461.0
2017-03-25 11:48:58    445.0
Name: light, dtype: float64

In [63]:

# Plot updated box plots for light data
plotBoxplots(light_data, 'light', 'Light Level (Lux)')

In [64]:

# ToDo: look into the remaining outliers

In [65]:

# Resample light data
light = light_data.resample('T').mean().dropna()

Bluetooth Devices¶

The Raspberry Pi 3 Model B comes with its own built in wifi and Bluetooth Low Energy (LE), so we were able to scan for the number of bluetooth devices. Since we were not able to limit the area we scanned to only our classroom, the sensor picked up devices from other classrooms. However, since we were in the same classroom each day, we still thought it could be a predictive variable.

In [66]:

# Create bluetooth devices dataframe with DateTimeIndex: bluetooth_data
bluetooth_data = df[['bluetooth_devices']].copy()

In [67]:

# Summary statistics for the bluetooth data
bluetooth_data.describe()

Out[67]:

	bluetooth_devices
count	46275.00000
mean	220.83369
std	140.24883
min	0.00000
25%	116.00000
50%	181.00000
75%	309.00000
max	635.00000

Looking at the summary statistics for bluetooth_devices, I was surprised that the minimum value was zero, since each of the students setting up the sensors had multiple devices emitting bluetooth signals.

When I looked into the data further, I found that similar to several other sensors, the first value recorded for each day was incorrect. Here, on May 5th, we can see that the first value is zero and immediately jumps to 30 and continues to rise.

In [68]:

# Look at the lowest values for May 5th
bluetooth_data['bluetooth_devices']['May 5, 2017'].head(10)

Out[68]:

datetime
2017-05-05 18:13:53     0
2017-05-05 18:13:58    30
2017-05-05 18:14:03    31
2017-05-05 18:14:08    31
2017-05-05 18:14:13    32
2017-05-05 18:14:18    34
2017-05-05 18:14:23    34
2017-05-05 18:14:28    38
2017-05-05 18:14:33    39
2017-05-05 18:14:38    39
Name: bluetooth_devices, dtype: int64

Unfortunately, unlike the other sensors, I found that when the bluetooth sensor was first started or when it randomly restarted, it would take a significant amount of time for it build back up. This can be seen by plotting the bluetooth_devices values for May 5, 2017. All through the morning, the number of devices is gradually increasing, until it suddenly restarts and drops to zero, and then begins slowing rising again.

In [69]:

# Plot the bluetooth values for May 5, 2017
fig, ax = plt.subplots(figsize=(12,6))
bluetooth_data['bluetooth_devices']['2017-04-01'].plot()

# Add a title and labels
ax.set_title('Bluetooth Values for May 5, 2017', fontsize=20)
ax.set_xlabel('Time of Day', fontsize=14)
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M %p'))
ax.set_ylabel('No. of Bluetooth Devices', fontsize=14)

# Show the plot
plt.show()

Another serious issue I discovered was that there were long periods of time when the sensor only recorded zeros. For example, on April 29th, only zero values were recorded for almost an hour of half.

In [70]:

# Identify the number of 0 values
fig, ax = plt.subplots(figsize=(12,6))
bluetooth_data['bluetooth_devices']['2017-04-29'].plot()

# Add a title and labels
ax.set_title('Bluetooth Values for April 29, 2017', fontsize=20)
ax.set_xlabel('Time of Day', fontsize=14)
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M %p'))
ax.set_ylabel('No. of Bluetooth Devices', fontsize=14)

# Show the plot
plt.show()

Unfortunately, given these issues, I won't be able to use the bluetooth data that we collected. At first I had considered different methods to fill the missing data and zero values. However, after plotting the data, I don't believe that the sensor accurately captured the number of devices.

I know from my own experience on campus that a large number of students and professors arrive in the morning, leave for an hour for lunch, and then stay in the building until the end of the day. However, the plots show a small number of devices each morning and then a gradual increase throughout the day. Furthermore, there is no decline during lunchtime. Our building only had one small coffee shop that sold a small selection of drinks and snacks, so to get lunch you had to leave the building. So for these reasons, I won't be able to use this feature to build my predictive models.

Images¶

Type of Sensor: Raspberry Pi Camera Module v2

Initially, to capture the ground truth for our dataset, our team setup an 8-megapixel camera with a wide-angle lens to take pictures of the classroom at one-minute intervals. Unfortunately, due to time restraints, we were not able to use them for that purpose. Furthermore, when we asked our fellow students for their permission to take pictures at the beginning of the project, we had promised them that the pictures would be deleted at the end of the semester, so they are no longer available.

However, our team was still able to capture data from the images by calculating the "closeness" based on the root-mean-square error (RMSE) between successive images. For identical images, the difference between them would be zero, so higher values indicate a greater divergence.

In [71]:

# Create image dataframe with DateTimeIndex: image_data
image_data = pd.read_csv('../data/image_variations.csv', index_col='datetime', parse_dates=True)

# Rename the rolling_rms column to image_rms
image_data.rename({'rolling_rms':'image_rms'}, axis=1, inplace=True)

In [72]:

# Summary statistics for the image data
image_data.describe()

Out[72]:

	control_F_rms	control_L_rms	image_rms
count	4469.000000	4469.000000	4469.000000
mean	35.647687	34.007843	13.331265
std	14.392077	10.525568	5.699915
min	0.000000	0.000000	0.000000
25%	24.656496	28.102343	10.005781
50%	28.884805	31.930907	12.042529
75%	46.719821	35.002443	15.774376
max	70.657752	69.706517	64.560408

In [73]:

# View the first five rows of the image data
image_data.head()

Out[73]:

	control_F_rms	control_L_rms	image_rms
datetime
2017-03-25 09:11:00	0.000000	68.764028	0.000000
2017-03-25 09:12:00	15.242697	69.110523	15.242697
2017-03-25 09:13:00	15.526992	69.169608	15.087697
2017-03-25 09:14:00	18.106792	69.253149	15.422978
2017-03-25 09:15:00	19.040465	69.159929	14.799398

Since we are calculating the change between successive images, the first value for each class period is zero.

In [74]:

# Identify the 0 values in the image data
image_data[image_data['image_rms'] == 0]

Out[74]:

	control_F_rms	control_L_rms	image_rms
datetime
2017-03-25 09:11:00	0.0	68.764028	0.0
2017-04-01 09:06:00	0.0	44.664898	0.0
2017-04-08 09:04:00	0.0	35.672150	0.0
2017-04-22 08:36:00	0.0	60.087447	0.0
2017-04-29 08:52:00	0.0	20.514254	0.0
2017-05-05 18:15:00	0.0	21.546584	0.0
2017-05-06 08:59:00	0.0	24.173091	0.0
2017-05-12 18:27:00	0.0	31.552215	0.0
2017-05-13 08:40:00	0.0	51.595340	0.0
2017-06-03 09:04:00	0.0	27.748093	0.0
2017-06-10 09:04:00	0.0	35.994072	0.0

So I'll delete the zero values now and then backfill them when concatenating the image data. In addition, I can delete the control_F_rms and control_L_rms columns now, since they are no longer needed.

In [75]:

# Delete zero values in the image data
image_data = image_data[image_data['image_rms'] != 0]

# Delete the control_F_rms and control_L_rms columns from the image data
image_data.drop(['control_F_rms', 'control_L_rms'], axis=1, inplace=True)

In [76]:

# Plot histogram of daily image data
plotHistogram(image_data, 'image_rms', 'Daily Image Data')

In [77]:

# Plot boxplots for daily image data
plotBoxplots(image_data, 'image_rms', 'Daily Image Data')

In [78]:

# ToDo: investigate outliers

In the following chart, you can see that the largest changes occurred during our class breaks, lunch, and at the end of the day. Typically, our professors gave us a 5-10 minute break in the morning between 10:00 am and 10:30 am and in the afternoon between 2:00 pm and 2:30 pm. Our lunch break was an hour long and most often began at 12:00 pm. Class ended approximately at 4:00 pm, but typically some students stayed afterwards to talk with the professor or to meet with the teammates, so we tried to stay until 4:30 pm to capture these changes.

In [79]:

# Create the figure and the axes
fig, ax = plt.subplots(figsize=(16,8))

# Plot image data for May 6, 2017
ax.plot(image_data.loc['May 6, 2017'])

# Add title and labels
ax.set_title('Image Data: May 6, 2017', fontsize=22)
ax.set_xlabel('Time of Day', fontsize=16)
ax.set_ylabel('% Change in Hist', fontsize=16)
ax.xaxis.set_major_formatter(md.DateFormatter('%I:%M %p'))

# Show plot
plt.show()

Since the images were taken each minute, I don't need to resample this DataFrame.

Occupancy Data¶

Our team created a Flask web application to log the number of people in the classroom. During breaks, one team member would use it to log each person who entered or exited the room. The app also included an "Empty Room" option that set the counter to 0. This was used most often used during the lunch break when there was no one from our group in the classroom to keep track of the number of occupants and at the end of the day. While we had originally intended to use the images to determine the total number of students in the classroom, we ultimately had to use the data generated by the Flask app for our model's target variable.

In [80]:

# Create occupancy count dataframe with DateTimeIndex: occupancy_data
occupancy_data = pd.read_csv('../data/occupancy_data.csv', index_col='datetime', parse_dates=True)

In [81]:

# Delete the unnecessary columns
occupancy_data.drop(['location', 'count_operation', 'count_change'], axis=1, inplace=True)

# Drop data for March 18, 2017
occupancy_data.drop(occupancy_data.loc['2017-03-18'].index, inplace=True)

In [82]:

# Occupancy summary statistics
occupancy_data.describe()

Out[82]:

	count_total
count	2298.000000
mean	18.888599
std	7.853676
min	0.000000
25%	14.000000
50%	20.000000
75%	26.000000
max	31.000000

The occupancy data contains 21 zero values. When I looked into the source of these values, I found that eleven of them were the last values of the day. For example, you can see below that at the end of the class period on March 25th, the number of occupants had started to decline as students left for the day, until suddenly the number went from 13 to 0. The reason for this sharp decline would have been, that since it was the end of the day, one of our team members would have clicked the "Empty Room" option on the Flask app.

In [83]:

# Identify the zero value at the end of March 25, 2017
occupancy_data['2017-03-25'].tail()

Out[83]:

	count_total
datetime
2017-03-25 16:19:38.807	16
2017-03-25 16:19:39.486	15
2017-03-25 16:19:40.085	14
2017-03-25 16:32:00.619	13
2017-03-25 16:35:14.861	0

The remaining zero values occurred when the students left for lunch. Our team tried to stay in the room until it was empty, but sometimes a few students would remain and we needed to take our break. You can see this below as the number starts to dwindle just after noon, until the number of occupants drops from 4 to zero. Since I want to keep the last value when resampling the occupancy data, I decided to delete these zero values since they don't accurately reflect the room's occupancy level at the time.

In [84]:

# Identify the zero value during lunch time on April 29, 2017
occupancy_data['2017-04-29 12:03']

Out[84]:

	count_total
datetime
2017-04-29 12:03:00.654	7
2017-04-29 12:03:08.546	6
2017-04-29 12:03:13.341	5
2017-04-29 12:03:17.889	4
2017-04-29 12:03:27.578	0

In [85]:

# Delete the 0 values from the occupancy data
occupancy_data = occupancy_data[occupancy_data['count_total'] != 0]

Resample Occupancy Data¶

Most often, when students were entering or leaving the classroom, a large number of them would do so at the same time. For example, when leaving for lunch or coming back from a break. Therefore, the Flask app kept track of changes in milliseconds.

However, I did not want to resample the occupancy data by taking the mean of the values per-minute, like the other sensor values. To understand why, below you can see that for one minute, at 8:57 am on April 29th, we had a total of 19 values recorded. If we were just take the mean of that data, it would be 9.68. However, since this was happening at a time when a large number of students were coming into the room, the room's occupancy went from 1 to 17 at that time. Therefore, when resampling the data, a more accurate measure would be to keep the last value during that time frame, which in this case would be 17.

In [86]:

# Show the values for April 29 at 8:57 am
occupancy_data['2017-04-29 08:57']

Out[86]:

	count_total
datetime
2017-04-29 08:57:14.854	1
2017-04-29 08:57:15.145	2
2017-04-29 08:57:15.409	3
2017-04-29 08:57:15.651	4
2017-04-29 08:57:15.904	5
2017-04-29 08:57:16.225	6
2017-04-29 08:57:16.505	7
2017-04-29 08:57:16.790	8
2017-04-29 08:57:17.063	9
2017-04-29 08:57:17.360	10
2017-04-29 08:57:17.648	11
2017-04-29 08:57:17.975	12
2017-04-29 08:57:18.266	13
2017-04-29 08:57:18.620	14
2017-04-29 08:57:18.940	15
2017-04-29 08:57:19.344	16
2017-04-29 08:57:24.687	15
2017-04-29 08:57:45.578	16
2017-04-29 08:57:54.742	17

In [87]:

# Resample occupancy data per minute and drop any NaN values
occupancy = occupancy_data.resample(rule='T').last().dropna()

In [88]:

# Plot boxplots for daily occupancy data
plotBoxplots(occupancy, 'count_total', 'Daily Occupancy')

Concatenate Sensor Data¶

Now that I've finished cleaning and resampling the individual sensor values (with the exception of the door data, which will be explained later), I'll concatenate them into a single DataFrame, backfilling any missing values.

In [89]:

# Concatenate cleaned sensor data in a new dataframe: sensor_data
sensor_data = pd.concat(
    [temperature, humidity, co2, light, sound, image_data, occupancy], axis=1).fillna(method='bfill').dropna()

sensor_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4519 entries, 2017-03-25 09:05:00 to 2017-06-10 16:42:00
Data columns (total 7 columns):
temperature    4519 non-null float64
humidity       4519 non-null float64
co2            4519 non-null float64
light          4519 non-null float64
sound          4519 non-null float64
image_rms      4519 non-null float64
count_total    4519 non-null float64
dtypes: float64(7)
memory usage: 282.4 KB

Door Status¶

The door sensor is an asynchronistic categorical feature with two possible values: closed and opened.

In [90]:

# Create door data dataframe with DateTimeIndex: door_data
door_data = pd.read_csv('../data/door_data.csv', index_col='datetime', parse_dates=True)

# Drop the location column
door_data.drop('location', axis=1, inplace=True)

# View summary information about the door data
door_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3286 entries, 2017-03-25 09:03:28 to 2017-06-10 16:42:50
Data columns (total 1 columns):
door_status    3286 non-null object
dtypes: object(1)
memory usage: 51.3+ KB

In [91]:

# View the head of the door data
door_data.head()

Out[91]:

	door_status
datetime
2017-03-25 09:03:28	opened
2017-03-25 09:03:36	closed
2017-03-25 09:04:09	opened
2017-03-25 09:04:13	closed
2017-03-25 09:07:14	opened

In [92]:

# Count the number of each category value
door_data['door_status'].value_counts()

Out[92]:

opened    1650
closed    1636
Name: door_status, dtype: int64

It has roughly an equal number of both values, since each time the door opens it must also close. At first, I considered using Pandas get_dummies() function to encode the variable, but this naturally led to two highly correlated variables, since it is almost always a single event. So first, I decided to encode door_status using scikit-learn's LabelEncoder.

In [93]:

# Encode the door feature
le = LabelEncoder()
door_data['door_status'] = le.fit_transform(door_data['door_status'])

Then to resample the data, I want to sum the number of the times the door had opened each minute, rather than calculating the mean value. Unfortunately, Pandas resample() function created over 100K extra rows containing zero values, since it sums values for every minute, regardless of whether there was any data for that point in time. So after resampling the data, I deleted these extra rows.

In [94]:

# Resample door status data
door = door_data.resample('T').sum()

# Delete rows containing zeros
door.drop(door[door.door_status == 0].index, inplace=True)

# View the head of the door data
door.head()

Out[94]:

	door_status
datetime
2017-03-25 09:03:00	1
2017-03-25 09:04:00	1
2017-03-25 09:07:00	1
2017-03-25 09:21:00	1
2017-03-25 09:34:00	1

In [95]:

# Concatenate sensor and door data in a new dataframe: sensor_data
sensor_data = pd.concat([sensor_data, door], axis=1).fillna(method='bfill')

# Rename the dataframe columns
sensor_data.columns = ['temp', 'humidity', 'co2', 'light', 'sound', 'images', 'occupancy', 'door']

sensor_data.head()

Out[95]:

	temp	humidity	co2	light	sound	images	occupancy	door
datetime
2017-03-25 09:03:00	23.80	36.90	781.0	430.0	511.0	15.242697	15.0	1.0
2017-03-25 09:04:00	23.80	36.90	781.0	430.0	511.0	15.242697	15.0	1.0
2017-03-25 09:05:00	23.80	36.90	781.0	430.0	511.0	15.242697	15.0	1.0
2017-03-25 09:06:00	23.80	38.95	765.9	426.9	502.0	15.242697	15.0	1.0
2017-03-25 09:07:00	23.85	38.91	768.3	422.4	510.4	15.242697	15.0	1.0

In [96]:

# Summary statistics for sensor data
sensor_data.describe()

Out[96]:

	temp	humidity	co2	light	sound	images	occupancy	door
count	4522.000000	4522.000000	4522.000000	4522.000000	4522.000000	4522.000000	4522.000000	4522.000000
mean	23.133796	39.518349	1233.812288	413.444793	292.246419	13.419000	24.251437	1.588678
std	1.413761	6.515284	182.715748	526.834037	175.958758	5.726220	6.704512	1.526797
min	21.000000	21.190909	653.000000	135.000000	30.600000	1.982398	1.000000	1.000000
25%	22.333333	37.734091	1101.198864	178.770833	143.750000	10.040250	22.000000	1.000000
50%	22.900000	39.633333	1261.233333	209.916667	144.333333	12.071834	27.000000	1.000000
75%	23.208333	44.706250	1365.770833	385.916667	501.444444	15.866360	29.000000	2.000000
max	29.350000	50.730000	1604.500000	2891.583333	574.000000	64.560408	31.000000	47.000000

Create Category Variable¶

Finally, I'll create a new variable by splitting occupancy into different levels. This will be my target variable when building classification models, while I'll use occupancy for regression models.

Unfortunately, the target variable is highly imbalanced toward high occupancy, which I did expect since the classroom was largely filled throughout the day, except during breaks and lunchtime. Our team did try to balance our dataset by leaving our sensors in the room overnight, but our plan was thwarted by a well-meaning cleaning lady.

In [97]:

# Create target array by slicing 'occupancy_count' column: occupancy_level
sensor_data['occupancy_level'] = pd.cut(
    sensor_data['occupancy'], [0, 13, 25, 31], labels=['low', 'mid-level', 'high'], include_lowest=True
)

sensor_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4522 entries, 2017-03-25 09:03:00 to 2017-06-10 16:42:00
Data columns (total 9 columns):
temp               4522 non-null float64
humidity           4522 non-null float64
co2                4522 non-null float64
light              4522 non-null float64
sound              4522 non-null float64
images             4522 non-null float64
occupancy          4522 non-null float64
door               4522 non-null float64
occupancy_level    4522 non-null category
dtypes: category(1), float64(8)
memory usage: 322.5 KB

In [98]:

# Show the balance of classes
visualizer = ClassBalance()
visualizer.fit(sensor_data['occupancy_level'])
visualizer.show()

Out[98]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f97642df9e8>

Save Data¶

I'll save the updated data into two different CSV files, one for building my classification models and one for my regression models. In the classification dataset, I won't include the occupancy variable since it used to create the target variable occupancy_level. Likewise, in the regression dataset I'll keep the occupancy variable and not occupancy_level, since occupancy will be my target variable when building my regression models.

In [99]:

# Rearrange columns for the classification dataset
classification_data = sensor_data[[
    'temp', 'humidity', 'co2', 'light', 'sound', 'images', 'door', 'occupancy_level']
]

In [100]:

# Export updated classification data to a CSV file: classification_data.csv
classification_data.to_csv('../data/classification_data.csv')

In [101]:

# Rearrange columns for the regression dataset
regression_data = sensor_data[['temp', 'humidity', 'co2', 'light', 'sound', 'images', 'door', 'occupancy']]

In [102]:

# Export updated regression sensor data to a CSV file: regression_data.csv
regression_data.to_csv('../data/regression_data.csv')

So that's all for now, but join me next time as I try out one of Yellowbrick's newest visualizers, DiscriminationThreshold, and finally see if my latest data cleaning efforts result in more predictive models this time around for my Capstone project.

And if you've read this far then you don't get to make fun of me for getting excited about such things!