How can BellaBeat play it smart? A Google Capstone Project Write Up

12 min readNov 19, 2021

This is an optional capstone project on Google’s Data Analytics program.

I’ve also posted the the code on [Github.]

The type of analysis we’re going to use in the course of this case study implements the 6 steps of Data Analysis as taught on Google’s Data Analytic program.

Step 1: Ask:

In this stage, we’re going to clearly define the problem and the objectives of our case study and the desired outcome of our study.

1.1: ‘..a little background’:

BellaBeat is a high-tech niche startup and a producer of a high-end smart products aimed at informing and inspiring women about their own health and habits. The company has witnessed a quick growth in the market and has positioned itself as a pioneer in tech-driven wellness trackers aimed in at Women.

Co-founder and CCO of Urška Sršen is confident that an analysis of non-BellaBeat consumer data (ie. FitBit fitness tracker usage data of respondents to a distributed survey via Amazon Mechanical Turk between the dates of 03.12.2016–05.12.2016) to help guide future marketing strategies for the team.

1.2: Business Task:

Analyzing FitBit tracker data for purposes of gaining insight into the usage of the device and app by customers in a bid to discover informing trends for BellaBeat’s marketing strategy.

1.3 Business Objectives:

* What are the trends identified?
* How could those trends be applied to BellaBeat’s customers?
* How could these trends inform BellaBeat’s marketing strategy?

1.4 Deliverables

* Clear summary of the business tasks and findings
* A description and linking of all the data sources used in the study
* Documentation of any data cleaning or data manipulation tasks performed
* A summary of the analysis with supporting visualizations and key findings of the study.
* Recommendations of the study.

1.5: Key Stakeholders

* Urška Sršen: BellaBeat’s co-founder and Chief Creative Officer
* Sando Mur: BellaBeat’s co-founder and key member of the BellaBeat executive team
* BellaBeat marketing analytics team: A team of data analysts in charge of BellaBeat’s marketing strategy.
—

Step 2: Prepare:

2.1: Used Data Sources

1. Public data generated off [Kaggle: Fitbit Tracker Data] stored in 18 csv files.
2. The Data in use is generated by Amazon’s Mechanical Turk surveyed between 4.12.16–5.12.16.
3. Personal data collected from 30 users who consented , including minute level monitoring of physical activity , heart rate as well as sleep patterns.

2.2: Limitations of the Dataset

* Sample size of 30 might be a little small and not entirely representative of the multitudes of FitBit users.
* It has been a while ( 5 years) since this dataset was generated and might not be reflective of current preferences as customer habits might have evolved.
* As data was generated using a third-party external survey and therefore we can’t 100% ascertain its integrity or accuracy.

2.3: Evaluating the Data for ROCCC

A good dataset is said to be ROCCC, short for reliable, original , current, comprehensive and cited.
We’re going to evaluate the data for the above requirements and rate it on three general scales of GOOD/MEDIUM/LOW.

* Reliability: LOW: Unrepresentative of the entire Fitbit users as sample size is too small ( 30).
* Originality: LOW: Dataset was generated by a third-party survey.
* Comprehensive: MEDIUM: Measured parameters match most of study’s scope.
* Current: LOW: Data is over 5 years old.
* Cited: LOW : — Data is once again from a third party and can not be trusted well.

Overall, the Dataset is considered bad quality data and generating business recommendations based solely off this dataset isn’t recommended.

2.4 Data Selection

The following file is selected and copied for analysis.

‘’ `’’ `dailyActivity_merged.csv`

2.5 Tools

We are using Python for data cleaning, transformation and generating visualizations.

— —

Step 3: Process

In this step, we’re going to attempt process the data by cleaning and ensuring that it is correct, relevant, complete and free of error and outlier by performing the following:

Check for missing and null values.
Data Exploration and observation.
Transform , format and perhaps cast data type.
Perform some preliminary statistical analysis on the dataset.

3.1 Environment

I’m adding the following Python libraries as aliased below for easier reading.

‘’ # import packages and alias
‘’ import numpy as np # data arrays
‘’ import pandas as pd # data structure and data analysis
‘’ import matplotlib as plt # data visualization
‘’ import datetime as dt # date time’’

3.2: Importing the Dataset

‘’ read_csv function to read CSV file
‘’ daily_activity = pd.read_csv(“../input/capstone-case-study-bellabit-fitbit-dataset/Fitabase Data 4.12.16–5.12.16/dailyActivity_merged.csv”)

3.3 Data cleaning/Manipulation Process

*Make basic observations on dataset.
*Check for null or missing values.
*Perform sanity check of data and decide if data should be sanitized.

Inspecting the first 10 rows to familiarize ourselves with the data

‘’ `daily`_activity.head(10)_

Seeking null and missing values from the dataset:

‘’ # obtain the number of missing data points per column
‘’ missing_values_count = daily_activity.isnull().sum()
‘’ # no of missing points in all columns
‘’ missing_values_count[:]

Finding out some of the basic information that can be deduced from the dataset:
* number of rows and columns
* column names
* non-null count
* data types

`daily_activity.info()`

Since this is an externally-generated dataset , we’re going to see if there are 30 unique IDs as claimed by the survey.

count unique values of “Id”
unique_id = len(pd.unique(daily_activity[“Id”]))
print(“# of unique Id: “ + str(unique_id))_

From the above observations, we can note the following details about the data and the datatype:

1. There are Null or missing values as stated in the null count.
2. Data has 15 columns and 940 rows.
3. ActivityDate is wrongly classified as object data type and maybe has to be converted to datetime64 data type, and there are 33 unique ids in the data instead of the claimed 30 respondents.

Now that we’ve identified aspects of the dataset, we can perform the following manipulations.

1. Cast ActivityDate to datatime64 dtype.
2. Convert format of ActivityDate to ~yyyy-mm-dd.~
3. Make a new Column called ~DayOfTheWeek ~by generating date in the form of day of the week for further analysis.
4. Create new column TotalMins being the sum of ~VeryActiveMinutes~, ~FairlyActiveMinutes~,~ LightlyActiveMinutes~ and ~SedentaryMinutes~
5. Create new column ~TotalHours~ by converting new column ~TotalMins~ in no. 4 to number of hours.
6. Rearrange and rename columns as needed.

For the purposes of our analysis we will convert ~ActivityDate~ from object to `datatime64` `dtype` and also convert ~ActivityDate~ to `yyyy-mm-dd`.

‘’# convert “ActivityDate” to datatime64 dtype and format to yyyy-mm-dd
‘’ daily_activity[“ActivityDate”] = pd.to_datetime(daily_activity[“ActivityDate”], format=”%m/%d/%Y”)
‘’ # print information to confirm
‘’ daily_activity.info()
‘’ #print the first 5 rows of “ActivityDate” to confirm
‘’ daily_activity[“ActivityDate”].head()

Creating new list with rearranged column names and renaming daily_activity to a shorter name df_activity.

#Create new list of rearranged columns
‘’ new_cols = [‘Id’, ‘ActivityDate’, ‘DayOfTheWeek’, ‘TotalSteps’, ‘TotalDistance’, ‘TrackerDistance’, ‘LoggedActivitiesDistance’, ‘VeryActiveDistance’, ‘ModeratelyActiveDistance’, ‘LightActiveDistance’, ‘SedentaryActiveDistance’, ‘VeryActiveMinutes’, ‘FairlyActiveMinutes’, ‘LightlyActiveMinutes’, ‘SedentaryMinutes’, ‘TotalExerciseMinutes’, ‘TotalExerciseHours’, ‘Calories’]
‘’
‘’ #reindex function to rearrange columns based on “new_cols”
‘’ df_activity = daily_activity.reindex(columns=new_cols)
‘’
‘’ # print 1st 5 rows to confirm
‘’ df_activity.head(5)

Adding a new column by separating the date into day of the week for further analysis.

‘’ # Creating a create new column called “day_of_the_week”
‘’ df_activity[“DayOfTheWeek”] = df_activity[“ActivityDate”].dt.day_name()
‘’ df_activity[“DayOfTheWeek”].head(5)

Rearranging and renaming columns from ~XxxYyy~ to ~xxx~_yyy._

Now I’ll create a new column by converting `total`_mins to `number of hours`._

‘’ create new column *total_hours* by dividing total_mins to 60.
‘’ df_activity[“total_hours”] = round(df_activity[“total_mins”] / 60)
‘’ print top 5 rows to confirm
‘’ df_activity[“total_hours”].head(5)

Our data sanitizing and cleaning effort has concluded. Data is now ready to be analyzed.

Step 4: Analyse:

Applying some statistical formulas to data to generate the following:

count — no. of rows
std (standard deviation)
mean (average)
min and max
percentiles of 25%, 50%, 75%

‘’# pull general statistics
‘’ df_activity.describe()

interpreting statistical findings:

Users have logged an average of 7637 total steps, which is much lower than the recommended amount of steps needed to benefit from general health. SRC: [https://www.cdc.gov/physicalactivity/walking/index.htm ]
Sedentary users made up for the majority logging on average 991 minutes or 20 hours making up 81% of total average minutes.
The average respondent burned about 2303 Calories or the equivalent of 0.658 pounds. This observation with the data however couldn’t be interpreted more as the amount of calories burned depend on numerous factors, such as age, weight, daily rate of activity, hormones and daily calorie intake and diet. [[] https://www.cdc.gov/healthyweight/calories/index.html ]

Step 5: Share

In this step, we share our findings based on our analysis of the data through visualization.

5.1 Visualization of the findings

‘’ #import matplotlib package
‘’ import matplotlib.pyplot as plt
‘’ plotting histogram
‘’ plt.style.use(“default”)
‘’ plt.figure(figsize=(7,5)) # specify size of the chart
‘’ plt.hist(df_activity.day_of_the_week, bins = 7,
‘’ width = 0.6, color = “lightskyblue”, edgecolor = “black”)
‘’ #adding annotations and visuals
‘’ plt.xlabel(“Day”)
‘’ plt.ylabel(“Frequency”)
‘’ plt.title(“No. of times users logged in a week”)
‘’ plt.grid(True)
‘’ plt.show()

Interpreting App use frequency

tracking frequency of the app usage through the week has revealed the following insight.

We have discovered that more users track their activities during the weekday. Logged steps seem to peak between Tuesday and Friday, with Tuesday being the highest frequency of logging day, it might be likely that such an occurrence might have been caused by the fact that many of the respondents either remember to register their steps sometime halfway through the week or are at work as the frequency goes down significantly in the days prior to Tuesday and after Friday.

‘’ #import matplotlib package
‘’ import matplotlib.pyplot as plt
‘’ plotting scatter plot
‘’ plt.style.use(“default”)
‘’ plt.figure(figsize=(8,6)) # specify size of the chart
‘’ plt.scatter(df_activity.total_steps, df_activity.calories,
‘’ alpha = 0.8, c = df_activity.calories,
‘’ cmap = “Spectral”)
‘’ add annotations and visuals
‘’ median_calories = 2303
‘’ median_steps = 7637
‘’ plt.colorbar(orientation = “vertical”)
‘’ plt.axvline(median_steps, color = “Blue”, label = “Median steps #”)
‘’ plt.axhline(median_calories, color = “Red”, label = “Median calories burned”)
‘’ plt.xlabel(“ # of Steps taken”)
‘’ plt.ylabel(“Calories burned”)
‘’ plt.title(“Calories burned for every step taken”)
‘’ plt.grid(True)
‘’ plt.legend()
‘’ plt.show()
‘’

Calories urned for every step taken.

From the above histogram, we can make the following deductions.

There’s a positive correlation between the amount of steps taken and calories burned.
The intensity of the calories burned increases around of > 0 to 15,000 steps with calories burn rate cooling down from 15,000 steps downwards.
Noted some outliers and anomalies ( zero steps with zero to minimal calories burned, and another observation of 1 observation of > 35,000 steps with < 3,000 calories burned (see image below).

This could perhaps be explained by the fact that natural variation of data, change in user’s usage and logging activities or errors in data collection and instrument reading (ie. miscalculations, data contamination or human error and conversion errors).

‘’ #import matplotlib package
‘’ import matplotlib.pyplot as plt
‘’
‘’ plotting a scatter plot
‘’ plt.style.use(“default”)
‘’ plt.figure(figsize=(8,6)) # Specifies size of chart
‘’ plt.scatter(df_activity.total_hours, df_activity.calories,
‘’ alpha = 0.8, c = df_activity.calories,
‘’ cmap = “Spectral”)
‘’
‘’ adding annotations and visualizations
‘’ median_calories = 2303
‘’ median_hours = 20
‘’ median_sedentary = 991 / 60
‘’
‘’ plt.colorbar(orientation = “vertical”)
‘’ plt.axvline(median_hours, color = “Blue”, label = “Median steps”)
‘’ plt.axvline(median_sedentary, color = “Purple”, label = “Median sedentary”)
‘’ plt.axhline(median_calories, color = “Red”, label = “Median hours”)
‘’ plt.xlabel(“Hours logged”)
‘’ plt.ylabel(“Calories burned”)
‘’ plt.title(“Calories burned for every hour logged”)
‘’ plt.legend()
‘’ plt.grid(True)
‘’ plt.show()

According to this scatterplot, we can make the following assessments.

A weak positive correlation whereby the increase of hours logged does not really translate to more calories being burned. That is largely due to the average sedentary hours (purple line) plotted at the 16 to 17 hours range.
Again, we can see a few outliers: The same zero value outliers.

import packages
‘’ import matplotlib.pyplot as plt
‘’ import numpy as np
‘’
‘’ calculating total of individual minutes column
‘’ very_active_mins = df_activity[“very_active_mins”].sum()
‘’ fairly_active_mins = df_activity[“fairly_active_mins”].sum()
‘’ lightly_active_mins = df_activity[“lightly_active_mins”].sum()
‘’ sedentary_mins = df_activity[“sedentary_mins”].sum()
‘’
‘’ plotting pie chart
‘’ slices = [very_active_mins, fairly_active_mins, lightly_active_mins, sedentary_mins]
‘’ labels = [“Very active minutes”, “Fairly active minutes”, “Lightly active minutes”, “Sedentary minutes”]
‘’ colours = [“lightcoral”, “yellowgreen”, “lightskyblue”, “darkorange”]
‘’ explode = [0, 0, 0, 0.1]
‘’ plt.style.use(“default”)
‘’ plt.pie(slices, labels = labels,
‘’ colors = colours, wedgeprops = {“edgecolor”: “black”},
‘’ explode = explode, autopct = “%1.1f%%”)
‘’ plt.title(“Percentage of Activity in Minutes”)
‘’ plt.tight_layout()
‘’ plt.show()

As seen on this chart, Sedentary minutes make biggest slice at 81.3%. perhaps this means that users are using the app to log daily activities such as daily commute, inactive movements (going from point A to B) or during running errands.
It doesn’t look like the App is being used to track fitness (ie. running) as per the minor percentage of fairly active activity (1.1%) and very active activity (1.7%). This observation is highly discouraging as the app was designed in the first place to encourage fitness by tracking activity. It good remembering that there are only 30 respondents in this data and this isn’t a sample size that’s sufficient enough to be representative of the average FitBit users.

Step 6: ACT

In the final step of our case study, we delivering our insights and providing recommendations based on our analysis.

Here, we will revisit our business questions once again and share our high-level business recommendations.

6.1. Trends Identified:

Majority of users (81.3%) are using the FitBit app to track sedentary activities and not using it for tracking their health habits as much.
More users track their activities during weekdays than they do during weekends. Perhaps the weekdays involve them being active with work, and staying indoors and resting during the weekends.

6.2. How could those trends apply to BellaBeat customers?

Both Companies focus on a product line that aims to help inform women form healthy habits by encouraging fitness, and the common recommendations involving activity and fitness can be well applied to BellaBeat customers.

3. Can those trends influence BellaBeat marketing strategy?

BellaBeat marketing team can encourage users by educating and equipping them with knowledge about fitness benefits, suggest different types of exercise (ie. simple 10 minutes exercise on weekday and a more intense exercise on weekends) and calories intake and burnt rate information on the BellaBeat app.
On weekends, BellaBeat app can also prompt push notification to send reminders and encourage users to exercise and be active and this could likely help with customers fitness and overall wellbeing.

Code is on GitHub.