Analytics With Seaborne#

Open In Colab

purple-divider

Introduction#

This project explores the individual rides in a bike-sharing system spanning the larger San Francisco Bay area included in the ‘fordgobike-tripdata’ dataset of over 180,000 records.

purple-divider

Preliminary Wrangling#

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
#loading th dataset into a pandas dataframe

df = pd.read_csv(r'https://video.udacity-data.com/topher/2023/July/64ac0039_fordgobike-tripdata/fordgobike-tripdata.csv')
df.head()
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984.0 Male No
1 42521 2019-02-28 18:53:21.7890 2019-03-01 06:42:03.0560 23.0 The Embarcadero at Steuart St 37.791464 -122.391034 81.0 Berry St at 4th St 37.775880 -122.393170 2535 Customer NaN NaN No
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972.0 Male No
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989.0 Other No
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974.0 Male Yes
df.describe()
duration_sec start_station_id start_station_latitude start_station_longitude end_station_id end_station_latitude end_station_longitude bike_id member_birth_year
count 183412.000000 183215.000000 183412.000000 183412.000000 183215.000000 183412.000000 183412.000000 183412.000000 175147.000000
mean 726.078435 138.590427 37.771223 -122.352664 136.249123 37.771427 -122.352250 4472.906375 1984.806437
std 1794.389780 111.778864 0.099581 0.117097 111.515131 0.099490 0.116673 1664.383394 10.116689
min 61.000000 3.000000 37.317298 -122.453704 3.000000 37.317298 -122.453704 11.000000 1878.000000
25% 325.000000 47.000000 37.770083 -122.412408 44.000000 37.770407 -122.411726 3777.000000 1980.000000
50% 514.000000 104.000000 37.780760 -122.398285 100.000000 37.781010 -122.398279 4958.000000 1987.000000
75% 796.000000 239.000000 37.797280 -122.286533 235.000000 37.797320 -122.288045 5502.000000 1992.000000
max 85444.000000 398.000000 37.880222 -121.874119 398.000000 37.880222 -121.874119 6645.000000 2001.000000
print(df.shape)
print(df.dtypes)
(183412, 16)
duration_sec                 int64
start_time                  object
end_time                    object
start_station_id           float64
start_station_name          object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             float64
end_station_name            object
end_station_latitude       float64
end_station_longitude      float64
bike_id                      int64
user_type                   object
member_birth_year          float64
member_gender               object
bike_share_for_all_trip     object
dtype: object
#making my copy of the original data set
bikeshare = df.copy()

#dropping rows containing null values from the data set
rows_to_drop = list(bikeshare[bikeshare.isna().any(axis=1)].index)

bikeshare = bikeshare.drop(rows_to_drop)
#dropping columns that wouldn't be relevant to my eplanatory analysis project
bikeshare = bikeshare.drop(['start_station_latitude', 'start_station_longitude', 'end_station_latitude', 'end_station_longitude', 'bike_share_for_all_trip'], axis = 1)
#converting some variables to a more appropriate data type 
def converter():
    bikeshare.bike_id = bikeshare.bike_id.astype(int)
    bikeshare.member_birth_year = bikeshare.member_birth_year.astype(int)
    bikeshare.start_station_id = bikeshare.start_station_id.astype(int)
    bikeshare.end_station_id = bikeshare.end_station_id.astype(int)
    bikeshare.member_gender = bikeshare.member_gender.astype('category')
    bikeshare.user_type = bikeshare.user_type.astype('category')
    bikeshare.start_time = pd.to_datetime(bikeshare.start_time)
    bikeshare.end_time = pd.to_datetime(bikeshare.end_time)
    
converter()
print(bikeshare.shape)
print(bikeshare.dtypes)
(174952, 11)
duration_sec                   int64
start_time            datetime64[ns]
end_time              datetime64[ns]
start_station_id               int64
start_station_name            object
end_station_id                 int64
end_station_name              object
bike_id                        int64
user_type                   category
member_birth_year              int64
member_gender               category
dtype: object
bikeshare.head()
duration_sec start_time end_time start_station_id start_station_name end_station_id end_station_name bike_id user_type member_birth_year member_gender
0 52185 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975 21 Montgomery St BART Station (Market St at 2nd St) 13 Commercial St at Montgomery St 4902 Customer 1984 Male
2 61854 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146 86 Market St at Dolores St 3 Powell St BART Station (Market St at 4th St) 5905 Customer 1972 Male
3 36490 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842 375 Grove St at Masonic Ave 70 Central Ave at Fell St 6638 Subscriber 1989 Other
4 1585 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074 7 Frank H Ogawa Plaza 222 10th Ave at E 15th St 4898 Subscriber 1974 Male
5 1793 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93 4th St at Mission Bay Blvd S 323 Broadway at Kearny 5200 Subscriber 1959 Male

purple-divider

What is the structure of your dataset?#

There are approximately 175,000 bike rides in this data set with 11 trip attributes (duration_sec, start_time, end_time, start_station_id, start_station_name, end_station_id, end_station_name, bike_id, user_type, member_birth_year, member_gender ). The member_gender and user_type features are categorical variables. Numeric data types are more dominant in the data set with 5 variables being integer values.

What is/are the main feature(s) of interest in your dataset?#

  • I’m interested in the time it takes for an average trip to be completed.

  • If any particular gender spends more time than the other on trips.

  • Another area of interest is the relationship between user_type and member_gender.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?#

  • I expect the duration_sec variable will be of great importance in understanding average time duration for trips.

  • Variables such as the user_type, member_gender,member_birth_year will also help in exploring relationships.

Univariate Exploration#

purple-divider

  • I will start by taking a look at some variables of interest.

binsize = 100
bins = np.arange(0, bikeshare['duration_sec'].max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = bikeshare, x = 'duration_sec', bins = bins);
plt.xlabel('Duration (sec)')
plt.title('Distibution of Trip Duration')
plt.xlim(0,3500);
../../_images/948f753022240e5c9caeb480fa45c086a3393abcf1f076c108a5137ac7ea36a4.png

Duration distribution comment 1: The distribution of the duration variable has a long tail and it is rightly-skewed.

Duration distribution comment 2: Finding the right bin size for this ditribution took a while. Reducing it to minutes or hours will be a good idea for me to explore.

#converting the duration variable from seconds to minutes

bikeshare['duration_min'] = np.round((bikeshare.duration_sec/ 60),2)
bikeshare = bikeshare.drop('duration_sec' , axis = 1)
bikeshare.head()
start_time end_time start_station_id start_station_name end_station_id end_station_name bike_id user_type member_birth_year member_gender duration_min
0 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975 21 Montgomery St BART Station (Market St at 2nd St) 13 Commercial St at Montgomery St 4902 Customer 1984 Male 869.75
2 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146 86 Market St at Dolores St 3 Powell St BART Station (Market St at 4th St) 5905 Customer 1972 Male 1030.90
3 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842 375 Grove St at Masonic Ave 70 Central Ave at Fell St 6638 Subscriber 1989 Other 608.17
4 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074 7 Frank H Ogawa Plaza 222 10th Ave at E 15th St 4898 Subscriber 1974 Male 26.42
5 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93 4th St at Mission Bay Blvd S 323 Broadway at Kearny 5200 Subscriber 1959 Male 29.88
#taking a look at the log distribution of duration to help determine a suitable bin size
np.log(bikeshare['duration_min']).describe()
count    174952.000000
mean          2.140649
std           0.702727
min           0.019803
25%           1.682688
50%           2.140066
75%           2.576422
max           7.250728
Name: duration_min, dtype: float64
#log transformation plot
binsize2 = 0.1
bins = 10 ** np.arange(0,np.log(bikeshare['duration_min'].max())+binsize2 , binsize2)
ticks =  [0.1, 0.3, 1 , 3, 10, 30, 100, 300, 1000]
labels = ['{}'.format(v) for v in ticks]


plt.figure(figsize=[8, 5])
plt.hist(data = bikeshare, x = 'duration_min', bins = bins)
plt.xscale('log')
plt.xticks(ticks, labels)
plt.xlabel('Duration (min)')
plt.title('Distibution of Trip Duration')
plt.xlim(0.02,1500);
../../_images/7caebecfcbd6bbaae09b51f5f7d8d7b33b23c8755e5c0f5627fe180ff542f365.png

Duration distribution comment 3: A log scale plot helps to view the distrbution more closely, there seems to be a few data points greater than where majority of the distribution lies(1- 100 minutes). An observed peak is seen at 10 minutes.

purple-divider

  • I will proceed to take a look at the user_type variable. Particulary, I will explore the distribution of user types across the data set.

base_color = sb.color_palette()[0]
plt.figure(figsize = [8, 5])

sb.countplot(data = bikeshare, x = 'user_type', color = base_color)


# percentage lable on the bar chart
n_trips = bikeshare.shape[0]
user_counts = bikeshare['user_type'].value_counts()
locs, labels = plt.xticks()  


for loc, label in zip(locs, labels):
    count = user_counts[label.get_text()]
    percentage = '{:0.1f}%'.format(100*count/n_trips)
    plt.text(loc, count-8, percentage, ha='center', color='r')
    
plt.title('Bike Trips by User category')
plt.xlabel('User Type');
../../_images/2b98826c2f24c0323ff328ebe13715d5fbe5c7c197c9f7f47f0619973cedfca1.png

Category distribution comment 1: A majority of trips were done by subscribers (90.5%) while customer category accounts for below 10% of the trips taken.

purple-divider

  • Next is to explore the start_time and end_time variables. Particularly I want to understand the start time and end time distribution of trips between hours of the day, I want to check for similarities in peak periods if any.

#extracting the starting hour and ending hour of each trip to understand trip distrbution per hour of the day
bikeshare['start_time_hr'] = bikeshare['start_time'].dt.time.astype(str)
bikeshare['end_time_hr'] = bikeshare['end_time'].dt.time.astype(str)


bikeshare['start_time_hr'] = bikeshare['start_time_hr'].str[0:2]
bikeshare['end_time_hr'] = bikeshare['end_time_hr'].str[0:2]
bikeshare.head()
start_time end_time start_station_id start_station_name end_station_id end_station_name bike_id user_type member_birth_year member_gender duration_min start_time_hr end_time_hr
0 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975 21 Montgomery St BART Station (Market St at 2nd St) 13 Commercial St at Montgomery St 4902 Customer 1984 Male 869.75 17 08
2 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146 86 Market St at Dolores St 3 Powell St BART Station (Market St at 4th St) 5905 Customer 1972 Male 1030.90 12 05
3 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842 375 Grove St at Masonic Ave 70 Central Ave at Fell St 6638 Subscriber 1989 Other 608.17 17 04
4 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074 7 Frank H Ogawa Plaza 222 10th Ave at E 15th St 4898 Subscriber 1974 Male 26.42 23 00
5 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93 4th St at Mission Bay Blvd S 323 Broadway at Kearny 5200 Subscriber 1959 Male 29.88 23 00
base_color = sb.color_palette()[0]
plt.figure(figsize = [14, 12])
plt.subplot(2, 1, 1)
sb.countplot(data = bikeshare, x = 'start_time_hr', color = base_color)

# percentage lable on the bar chart
n_trips = bikeshare.shape[0]
user_counts = bikeshare['start_time_hr'].value_counts()
locs, labels = plt.xticks()  


for loc, label in zip(locs, labels):
    count = user_counts[label.get_text()]
    percentage = '{:0.1f}%'.format(100*count/n_trips)
    plt.text(loc, count-8, percentage, ha='center', color='r')
    
plt.title('Bike Trips Start Time per hour of the day')
plt.xlabel('Hours of the day (24hr)');

plt.subplot(2, 1, 2)

sb.countplot(data = bikeshare, x = 'end_time_hr', color = base_color)

# percentage lable on the bar chart
n_trips = bikeshare.shape[0]
user_counts = bikeshare['end_time_hr'].value_counts()
locs, labels = plt.xticks()  


for loc, label in zip(locs, labels):
    count = user_counts[label.get_text()]
    percentage = '{:0.1f}%'.format(100*count/n_trips)
    plt.text(loc, count-8, percentage, ha='center', color='r')
    
plt.title('Bike Trips End time per hour of the day')
plt.xlabel('Hours of the day (24hr)');
../../_images/4abca0821c0b77b2280b22961c0988b35a1bf4b604c01778e6b2c563e46cd8e1.png

Hourly distribution comment 1: Both distributions appears bimodal whereby peak periods are observed at 8am and 9am as well as 5pm to 6pm with the most busy period recorded at 5pm for the start time. The trip end time also mimics the start time distrbution with a peak period of 5pm as well.

purple-divider

  • What gender is predominant in larger San Francisco Bay area when it comes to bike trips? Let me take a look at this.

#plotting age distribution
base_color = sb.color_palette()[0]
plt.figure(figsize = [8, 5])

sb.countplot(data = bikeshare, x = 'member_gender', color = base_color)


# percentage lable on the bar chart
n_trips = bikeshare.shape[0]
user_counts = bikeshare['member_gender'].value_counts()
locs, labels = plt.xticks()  


for loc, label in zip(locs, labels):
    count = user_counts[label.get_text()]
    percentage = '{:0.1f}%'.format(100*count/n_trips)
    plt.text(loc, count-8, percentage, ha='center', color='r')
    
plt.title('Gender Distribution Across Dataset')
plt.xlabel('Gender Type');
../../_images/e3102686bbd47d12418ad260f8d09d0a1f8362eeb2a53b78b0af9fbd06eb79c0.png

Gender distribution comment 1: As seen above, the male gender appears to take more bike trips than the female and other gender categories present in the dataset. More than half(75%) of the bike trips were actioned by men and less than 3% belong to the other gender category.

purple-divider

  • Another variable of interest is the member_birth_year. With this variable I can take a look at the age distribution if trips within the dataset.

#writing a function to help extract age from the 'member birth year' variable
from datetime import date

def extract_age(dates):
    today = date.today()
    return today.year - dates

bikeshare['member_age'] = bikeshare.member_birth_year.map(extract_age)
plt.figure(figsize=[8, 5])

binsize = 5
bins = np.arange(0, bikeshare['member_age'].max()+binsize, binsize)

plt.hist(data = bikeshare, x = 'member_age', bins = bins)


plt.xlim(20, 70)
plt.xlabel('Age')
plt.title('Member Age Distribution');
../../_images/5fd2cc9ba7291c376c16dade4d5e9f8f7f15133edbb9addcd99f5c01dba7f414.png

Member age distribution comment 1: A right-skewed distribution is observed for the age variable with a peak between the ages of 30 to 35.

purple-divider

  • Stations should experience more activity in comparison to each other. Patricularly I want to see the busiest stations in reference to start and end trips.

#setting color palette and figure size for the plots
base_color = sb.color_palette()[0]
plt.figure(figsize = [8, 10])

#Creation of plot for top five busiest bike starting stations
plt.subplot(2, 1, 1)

station_order = bikeshare['start_station_name'].value_counts().iloc[:5].index
sb.countplot(data = bikeshare, y = 'start_station_name', color = base_color, order = station_order)


    
plt.title('Top five busiest start stations')
plt.ylabel('Station Name')

#Creation of plot for top five busiest bike starting stations
plt.subplot(2, 1, 2)

station_order = bikeshare['end_station_name'].value_counts().iloc[:5].index
sb.countplot(data = bikeshare, y = 'end_station_name', color = base_color, order = station_order)


    
plt.title('Top five busiest end stations')
plt.ylabel('Station Name');
../../_images/a188cb85db7b4fb1af1b8fb70ece1f1b5b7874d4fd2f36f796a368b8867fe483.png

Busiest stations comment 1: The top two busiest end stations are ‘San Francisco Caltrain Station 2’ and ‘Market St at 10th St’ with both of them switching postions for the start station. Considering the whole dataset, San Francisco Caltrain Station 2 is busiest bike station.

purple-divider

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?#

DRURATION('duration_sec')

  • The duration variable initially had large range pof values due to it being in seconds, a conversion to minutes helped me see things a little bit more clearly.

  • The duration variable followed a long-tialed distribution so I had to perform a log tranformation to the ‘X’ variable for a better drilled-down view of the distribution.

USER CATEGORY('user_type')

  • The duration variable followed a long-tialed distribution so I had to perform a log tranformation to the ‘X’ variable for a better drilled-down view of the distribution.

TIME('start_time','end_time')

  • Both variables mimic same ditribution pattern. Peak hours are present in both the first and second half of each day with the second half of the day recording the highest percentage trip activity.

GENDER('member_gender')

  • This variable followed an expected distrbution flow with ‘males’ being the predominant gender type.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?#

DRURATION('duration_min')

  • The duration variable initially had large range of values due to it being in seconds, a conversion to minutes helped me see things a little bit more clearly. This further helped determine a better bin size to apply.

TIME('start_time','end_time')

  • I engineered the start_time_hr and end_time_hr variables to help visualise the distribution of trips by hours of the day.

purple-divider

Bivariate Exploration#

purple-divider

  • I start my bivariate explorations by exploring the relationship between trip duration and gender category as well as trip duration against user type category

# scatter plot of duration vs user type
sample = bikeshare[bikeshare['duration_min'] < 70]
base_color = sb.color_palette()[0]
plt.figure(figsize = [18, 4])

#Plot relationship between duration and user category
plt.subplot(1, 2, 1)

sb.violinplot(data= sample, x = 'user_type', y = 'duration_min', color=base_color, inner='quartile')


plt.xlabel('User Type')
plt.ylabel('Trip Duration (min)')
plt.title('Duration Against User Type')

#Plot relationship between duration and gender type
plt.subplot(1, 2, 2)

sb.violinplot(data= sample, x = 'member_gender', y = 'duration_min', color=base_color, inner='quartile')


plt.xlabel('Gender Type')
plt.ylabel('Trip Duration (min)')
plt.title('Duration Against Gender Type');
../../_images/a3100e06ab3ffb6fe4bcd4f0c540f676586088b9a056f84a1cd4a94f94bece5e.png

A majority of the subscriber category members’ travel duration fall within the range of 1 to 10 minutes with over 50% falling below 10 minutes. The customer category have slightly over 25% of them with trip durations lower than 10 minutes while the rest fall above the 10 minutes cap. A more spreadout trip duration distribution is obeserved in the customer category type.
The ‘male’, ‘female’ and ‘other’ gender types show similiar relationships with the trip duration variable regardless of their respective size. `

purple-divider

  • Next is to inspect the relationship between age distibution and trip duration

#Plotting of Trip Duration against age
plt.figure(figsize = [8, 6])
plt.scatter(data = bikeshare, x = 'member_age', y = 'duration_min', alpha = 1/12)
plt.xlabel('Age')
plt.xlim(20, 70)

plt.ylabel('Trip Duration (min)')
plt.title('Duration Against User Age');
../../_images/15245d0b391fdcd7327dd312b5725ecacdf9416b494ad7ae99dfd853d6975559.png

A general overview shows that the older age distribution seldom embark on longer trips as opposed to the younger ages. Age group 33 shows the strongest relationship with the most people of this age group embarking on longer trips.

purple-divider

  • Is there a relationship between the trip durations and the hour the day people start their trips or time of the day members end their trips? I will inspect this using the a plot of trip duration against start hour and end hour variables.

plt.figure(figsize = [18, 8])


#Plot relationship between duration and start time
plt.subplot(1, 2, 1)
plt.scatter(data = bikeshare, y = 'start_time_hr', x = 'duration_min', alpha = 1/12)
plt.ylabel('Start Hour')
plt.xlabel('Duration (min)')
plt.title('Start Hour Against Duration')

#Plot relationship between duration and end time
plt.subplot(1, 2, 2)
plt.scatter(data = bikeshare, y = 'end_time_hr', x = 'duration_min', alpha = 1/12)
plt.ylabel('End Hour')
plt.xlabel('Duration (min)')
plt.title('End Hour Against Duration');
../../_images/67f743cbc596d4990e4981f3336b79ed4b3a8c1e5936b88f64cc58714c25a8db.png

In reference to the start hour vs duration plot, relationship gets stronger between the hours leading to noon and trip duration. In other words, the closer the time is to 12pm, the more likely that bike trips would start off.
In reference to the end hour vs duration plot, a stronger relationship is observed with trips that end at later hours of the day. In other words, longer duration trips seldom end at the late hours of the day as opposed to the early hours of the day.

purple-divider

  • Is there a pattern or correlation present between categorical and numeric data types? A Matrx plot of numeric agaist categorical data can show me at a galance.

#assigning all variables to their respective data categories

numeric_vars = ['duration_min','member_age','start_time_hr','end_time_hr']
categoric_vars = ['member_gender','user_type', ]
#Creating a matrix plot to view numeric against catgorical data
def boxgrid(x, y, **kwargs):
    """ Quick hack for creating box plots with seaborn's PairGrid. """
    default_color = sb.color_palette()[0]
    sb.boxplot(x=x, y=y, color=default_color)

plt.figure(figsize = [15, 10])
g = sb.PairGrid(data = sample, y_vars = [ 'member_age', 'duration_min'], x_vars = categoric_vars, size = 3, aspect = 1.5)
g.map(boxgrid);
<matplotlib.figure.Figure at 0x7fd60cd08a90>
../../_images/e64c26f05bb9ebb05a54f731e276d0618878a82bc63a97f8aa46818d104ba4e1.png

This clearly informs that the average age of all gender and customer category is less that 40 yrs albeit several outliers present in the dataset.

From the gender point of view, average trip duration is less than 10 minutes albeit several outlier present in the distribution. From the user_type point of view, average trip druation for customer category is over 10 minutes while that od the subscriber category are less than 10 minutes.

purple-divider

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?#

DRURATION('duration_min' Vs 'member_age')

  • The relationship between age and trip duration revealed that members of age 33 engaged in more trips as well as longer trip durations.

GENDER('duration_min' Vs 'member_gender')

  • Surprisingly, the male gender spend lesser time on thier bike trips in comparison to other gender categories, alibeit being the predominant gender across the dataset.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?#

START AND END TIME('end_station_hr','start_station_hr' Vs 'duration_min')

  • It is quite interesting to see that members mostly start their trip between hours close to 12 pm and another relationship between trip duration and end station time shows a larger number of people end their trips as the day progresses into late night .

purple-divider

Multivariate Exploration#

#Plotting a stacked barplot for gender category of busiest start stations by trip durations
sb.set(rc={"figure.figsize":(12, 9)})
station_order = sample['start_station_name'].value_counts().iloc[:5].index
sb.barplot(data = sample, x = 'duration_min', y = 'start_station_name', hue = 'member_gender', order = station_order)
plt.legend()

plt.ylabel('Start Station Name')
plt.xlabel('Duration (min)')
plt.title('Gender Activity Across Top 5 Busiest Start Stations');
../../_images/0bbc34816ac9ebf6c7505d13936e5a797ad59a776f21e9172272b8aab4924d94.png
#Plotting a stacked barplot for gender category of busiest end stations by trip durations
sb.set(rc={"figure.figsize":(12, 9)})
station_order = sample['end_station_name'].value_counts().iloc[:5].index
sb.barplot(data = sample, x = 'duration_min', y = 'end_station_name', hue = 'member_gender', order = station_order)
plt.legend()
plt.ylabel('End Station Name')
plt.xlabel('Duration (min)')
plt.title('Gender Activity Across Top 5 Busiest End Stations');
../../_images/454ce33347aa89cea2497951fb528b8df835ba6f6ffe0f3781e8445233f05219.png

purple-divider

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?#

TOP_START&END_STATION

  • For a majority of the busiest station, whether start or end, the male gender category seem to keep things busy the most.

Were there any interesting or surprising interactions between features?#

TOP_START_STATION

  • More male gender type in the dataset doesn’t directly interprets to a dominance in all variables, this came as a surprise. Here in the top 5 busiest start stations, it can be seen that the Other gender category keeps the ‘San Francisco Caltrain Station 2’ and ‘Powell St BART Station (Market St at 4th St) stations the busiest in terms of trip durations at start stations.

TOP_END_STATION

  • Similarly, the Other gender category kept ‘Market St at 10th St’ and ‘San Francisco Caltrain Station 2 (Townsend St at 4th St)’ stations most busy when it comes to the trip durations recorded for to five end stations as opposed to the male gender category.

purple-divider

Conclusions#

  • Average trip duration for the entire dataset hovers around 10 minutes. Most people start their trips for the day at 8am while evening trips are mostly started at 5pm. A percentage of 75 male gender type dominate the data set and majority (95%)of the bike riders are Subscriber category user types. Over 50% of the Subscriber category users travel less than a total of 10 minutes while the Customer user types record more minutes during bike trips.

  • Users of age range 32 to 35 ride bikes the most and average age across all categories is below 40yrs.

  • A surprising outcome is that the Other gender category took longer trips at some of the busiest stations, despite it being of the lowest percentage of gender type present in the dataset.

# Converting my dataframe to a csv for later use such as the presentation slide decks
bikeshare.to_csv('bike_rides.csv',index=False, encoding = 'utf-8')