Exploratory Data Analysis(EDA)

Exploratory Data Analysis(EDA)#

Introduction#

This project will be investigating the ‘No-show appointments’ dataset for hospital appointment attendance status and other characteristics for 100,000 patients in Brazil, South America.

In particular, I will be exploring trends concerning attendance rates between male and females, factors that may or may not have influenced attendance such as SMS prompts, welfare enrolments and Handicap level, average age of attendees and non-attendees.

Dataset Description#

This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row. ‘ScheduledDay’ tells us on what day the patient set up their appointment. ‘Neighborhood’ indicates the location of the hospital. ‘Scholarship’ indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família..

Question(s) for Analysis#

What age ranges are more likely to honor appointments as well as the ages that most likely wouldn’t?
Does SMS prompts play an important role in appointment showup rates?
What percentage of males and females are enrolled in welfare programs?

Dependencies and Data#

The following code blocks contain import for dependencies used in this project including the CSV file.

purple-divider

#This cell loads in all packages used throughout the project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df = pd.read_csv('https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd2e9a_noshowappointments-kagglev2-may-2016/noshowappointments-kagglev2-may-2016.csv')

Data Properties#

The following cells contains statements used to check the various properties of the dataframe such as;

Column names
Data Types in each column
Number of Rows and Columns
Columns with missing values

A quick description command on the dataframe will also help preview trends at a glance.

purple-divider

#The number if rows and columns are given below to give an idea on how large the dataset is.
df.shape

(110527, 14)

#information about the dataframe such as data types in each column is detailed below.
df.info()

df.describe()

A quick glance at the dataset description is clearly highlighting that there is at least one occurrence of an invalid age value(-1.0) in the age column. This will be treated in the Data cleaning section.

Data Wrangling#

There will be inspection of the dataset in this section. Checks for duplicates and null values are necessary checks in order to determine the amount and type of cleaning and conversions that would be needed.

This section will also dive into data cleaning processes after proper checks.

purple-divider

#The first ten rows in the dataset is queries to give and overview on what the data looks like.
df.head(10)

A view of what the first ten rows in our dataset looks like to give an idea of what is contained within the dataset.

#this statement is used to get the total number of duplicated rows: it appears we have none 

df.duplicated().sum()

Total number of duplicate rows in our data is computed in the cell above as a rule for data inspection.

# the '.info()' gives necessary information to determine if any column has null values present.
df.info()

A quick look at the information attributes for the entire dataframe.

Data Cleaning#

purple-divider

#Dropping columns that wouldn't be of use in my analysis

df.drop(['PatientId', 'AppointmentID'], axis = 1, inplace = True)

Following the data preview done in the previous section, it appears that there are some columns that wouldn’t be needed as regards the purview of my analysis such as the ‘PatientId’ and ‘AppointmentID’ columns . Dropping them will enable me focus on the sections that are of use to me.

#statement converting all dataframe column to lowercase for easy access

df.columns= df.columns.str.lower()

A good practice is to convert all column names to lower case because it is somewhat stressfull to always check the column name every time you intend to access them during analysis.

#Renaming of columns and making the changes permanent with the 'inplace' command

df.rename(columns = {'no-show': 'no_show','hipertension':'hypertension','handcap':'handicap'}, inplace = True)

After observing the cloumn names in the previous section, column ‘no-show’ seems to have a python inbuilt operator as part of its naming. It is of great importance to rename this column so as to avoid conflicts and errors when referencing column ‘no-show’.

Column ‘hipertension’ and ‘handcap’ are spelled wrongly so I decided to make a correction to ‘hypertension’ and ‘handicap’ respectively.

#correcting data entry error by renaming 'Yes' to 'no' and renaming 'No' to 'yes'

df['no_show'].replace({'No' : 'yes'}, inplace = True)
df['no_show'].replace({'Yes' : 'no'}, inplace = True)

Documentation states there was an error in the no_show column where ‘Yes’ was recorded as ‘No’ and ‘No’ was entered as ‘Yes’. This cell renames the values of the no_show column

#Converting Date type data in the dataframe
def date_convert():
    df['scheduledday'] = pd.to_datetime(df['scheduledday'])
    df['appointmentday'] = pd.to_datetime(df['appointmentday'])
    return df

After data inspection, it was obesrved that the ‘scheduledday’ and ‘appointmentday’ columns are dates but their data types do not correspond, they appear to be strings which should not be so. Dates are to be treated as datetime formats for smooth analysis process.

#checking datafarme properties for confirmation
date_convert()

df.info()

This cell checks to confirm that the date conversions were successful.

#Querying invalid age entry from the dataframe

invalid_age = df[(df.age < 1)]
invalid_age.describe()

Taking a look at the category of data that has invalid age entry as spoted while inspecting the data properties (see section Dependencies and Data).
As seen, we have a total of 3540 rows with invalid age input of less than 1. A further inspection of this category is necessary inorder to make a decision on whether to delete the entries from the dataset or fill the wrong values.

fig, ax = plt.subplots(2,4,figsize=(20,10))
explode = (0.1, 0)
colors = 'orange'

ax[0,0].hist(invalid_age.gender, color = colors)
ax[0,0].set_title('Gender')
ax[1,0].hist(invalid_age.handicap, color = colors)
ax[1,0].set_title('Handicap')
ax[1,1].hist(invalid_age.hypertension, color = colors)
ax[1,1].set_title('Hypertension')
ax[1,2].hist(invalid_age.scholarship, color = colors)
ax[1,2].set_title('Scholarship')
ax[0,1].hist(invalid_age.alcoholism, color = colors)
ax[0,1].set_title('Alcoholism')
ax[0,2].hist(invalid_age.diabetes, color = colors)
ax[0,2].set_title('Diabetes')
ax[0,3].hist(invalid_age.sms_received, color = colors)
ax[0,3].set_title('Sms_received')
ax[1,3].hist(invalid_age.no_show, color = colors)
ax[1,3].set_title('No_Show');

This cell contains an overview plot of other characteristics of the group with ages less than 1. These plots give enough details to compare with the general characteristics of our entire dataframe. I am looking to inspect if there are any peculiarities about this age group by comparing their characteristics with that of the entire datafarme.

fig, ax = plt.subplots(2,4,figsize=(20,10))
explode = (0.1, 0)

ax[0,0].hist(df.gender)
ax[0,0].set_title('Gender')
ax[1,0].hist(df.handicap)
ax[1,0].set_title('Handicap')
ax[1,1].hist(df.hypertension)
ax[1,1].set_title('Hipertension')
ax[1,2].hist(df.scholarship)
ax[1,2].set_title('Scholarship')
ax[0,1].hist(df.alcoholism)
ax[0,1].set_title('Alcoholism')
ax[0,2].hist(df.diabetes)
ax[0,2].set_title('Diabetes')
ax[0,3].hist(df.sms_received)
ax[0,3].set_title('Sms_received')
ax[1,3].hist(df.no_show)
ax[1,3].set_title('No_Show');

Histograms plots for general characteristics of the ‘No-show appointments’ dataset.

In comparison to the histogram plots for the age group less than 1 years of age, it can be deduced that there are a few differences in trends accross all characteristics.

Deduction Summary#

Generally, more people showup for their appointments in both data categories.
Scholarship shows same type of distribution in both groups.
There are more females than males in the entire dataset as opposed to more males than females in the wrong age category.
‘alcoholism’, ‘hypertension’,’diabetes’ and ‘handicap’ columns recorded in 0 cases for all patients in the wrong-age-entry category while other patients are classified under one or more of these columns in the general dataset.

Deductions based on the comparisons have informed my decision to fill the wrong-age-entry category with the average age accross the original dataset due to the fact that the both groups largely share same characteristics

#Replacing invalid ages with the mean age value accross the entire dataset.

df['age'].replace({ -1 : df.age.mean(), 0 : df.age.mean()}, inplace = True)

All invalid ages are replaced in the cell above and making the changes inplace for the rest of my analysis.

# A good practice is to check if changes have been effected

df.describe()

#checkinng if there are still ages less that 1 remaining in the dataframe

df[(df.age < 1)]['age'].count()

Both cells above helps to inform that changes have been successfully effected accross the dataframe

Exploratory Data Analysis#

purple-divider

What age ranges are more likely to honor appointments as well as the ages that most likely wouldn’t?

Is attendance a function of age?

#querying two categories from our orginial datasets based on those that honor and did not honor appointments

showup = df.query('no_show == "yes"')
nshowup = df.query('no_show == "no"')

To understand if honoring appointments is a function of age, we need to break down our dataframe in to sections for purpose of analysis. I start by creating a query for all patients that showedup and those that did not showup being the subject of my question.

#Plotting statements using matplolib
colors = ['skyblue', 'red']
locations = [1, 2]
height = [showup['age'].mean(), nshowup['age'].mean()]
labels = ['Yes', 'No']
plt.figure(figsize=(15,7))
plt.bar(locations, height , color = colors, tick_label=labels)
plt.legend(loc="upper right")
plt.title("Average Age of Clients that Honored & Didn't Honor Appointments")
plt.xlabel('Showup Response')
plt.ylabel('Age');

Based on the chart above, it shows that slightly older people tend to honor medical appointments more. in order words, the closer and individual gets to age 40(i.e age 36,37,39 etc), the likelier the individual will honor hospital appintments.

It would be interesting to see how this varies amongst genders as I continue my analysis.

#A further breakdown of the query created earlier

showup_f = showup.query('gender == "F"')
nshowup_f = nshowup.query('gender == "F"')
showup_m = showup.query('gender == "M"')
nshowup_m = nshowup.query('gender == "M"')

I am breaking down the first set of queries(‘showup’ and ‘nshowup’) into subcategories for males and females in order to analyse the distribution of attendance based on average age for the new subcategories.

#Plotting statements using matplolib

colors = ['skyblue']
color1 = ['red']
locations = [1, 2]
mean_s = [showup_f.age.mean(), showup_m.age.mean()]
mean_ns = [nshowup_f.age.mean(), nshowup_m.age.mean()]
labels = ['Female', 'Male']
plt.figure(figsize=(15,7))
plt.bar(locations, mean_s, color = colors, tick_label=labels, label="Yes")
plt.bar(locations, mean_ns, color = color1, tick_label=labels, label="No")
plt.legend(loc="upper right")
plt.title("Average Age of Clients that Honored & Didn't Honor Appointments by Gender")
plt.xlabel('Gender')
plt.ylabel('Age');

This communicates more insight on the connection between age and attendance according to gender. Some key notes include;

The female distribution conforms closely with the general trend observed in the previous chart. Average age of females that honor appointments is between 40 and 41, while the average age of females that fail to showup is between age 35 and 36.
The male distribution however shows a general attribute of a lesser average age of attendance and non attendance.

Do SMS prompts play an important role in appointment showup rates?

# A look at the number of people that received SMS in showup and noshowup categories

sms_show = showup.query('sms_received == 1')['sms_received'].count()
sms_nshow = nshowup.query('sms_received == 1')['sms_received'].count()

sms_show, sms_nshow

Investigating the influence of SMS prompts require drilling further into the first queries created for our analysis(‘showup’ and ‘nshowup’). The cell above contains statements generating the number of people that receives SMS and showedup for their appointments and those that didn’t show despite receiving SMS prompts

# A look at the number of people that did not receive SMS in showup and noshowup categories

nsms_show = showup.query('sms_received == 0')['sms_received'].count()
nsms_nshow = nshowup.query('sms_received == 0')['sms_received'].count()

nsms_show, nsms_nshow

#This section helps to position the plots below side by side as well as the overall  size of the plot displayed
fig, ax = plt.subplots(1,2,figsize=(15,7))
explode = (0.1, 0)

#This plot(1) is for the category of people that received SMS prompts
color = ['green','pink']
locations = [1, 2]
show_status = [sms_show, sms_nshow]
labels = ['Yes', 'No']
ax[0].pie(show_status, autopct = '%1.1f%%', labels = labels, colors = color)
ax[0].legend(loc="upper left")
ax[0].set_title("Percentage of Honored &\nUnhonored Appointments\nAmongst SMS Receivers")


#This plot(2) is for the category of people that did not receive SMS prompts 
color = ['green','pink']
locations = [1, 2]
show_status = [nsms_show, nsms_nshow]
labels = ['Yes', 'No']
ax[1].pie(show_status, autopct = '%1.1f%%', labels = labels, colors = color)
ax[1].legend(loc="upper right")
ax[1].set_title("Percentage of Honored &\n Unhonored Appointments\nAmongst non-SMS Receivers");

Clearly, SMS prompts doesn’t seem to influence attendance in any way.

A large percentage(83.3%) of people that didn’t receive SMS prompts honored their appointments while 16.7% did not.
72.4% of SMS receivers honored their appointment but this is obviously lesser than the attendees that did not receive any SMS prompts whatsoever.

What percentage of males and females are enrolled in welfare programs?

# The following statement groups the dataframe into different gender category and calculates
#the sum of scholarships in each category using integer references from the resulting array.

schshp_f = df.groupby(['gender'])['scholarship'].sum()[0]
schshp_m = df.groupby(['gender'])['scholarship'].sum()[1]

schshp_m, schshp_f

Exploring the distribution of scholarship enrolments by gender involves cateorising our dataframe into the existing gender distribution and summing up the scholarship values because the data type in the ‘scholarship’ column is ‘int’ and it consists of two distinct values(0 and 1).

#All male and female count generated using the gender column and possitional selection for each gender value.

male_count = df.groupby(['gender'])['gender'].count()[1]
female_count = df.groupby(['gender'])['gender'].count()[0]

male_count, female_count

Since I intend to investigate the percentage distribution, the total count for both male and female genders in our dataset needs to be calculated as given above.

#calculating the difference between males and females enrolled in  welfare programs and those that were not.

schshp_diff_m = male_count - schshp_m
schshp_diff_f = female_count - schshp_f

schshp_diff_m, schshp_diff_f

Investigating using percentage was chosen because the number of females in the distribution far outnumbers the males so plotting aggregates like count may not communicate the exact picture of the welfare enrollment distribution amongst genders.

#This section helps to position the plots below side by side as well as the necessary size of the plot displayed
fig, ax = plt.subplots(1,2,figsize=(15,7))
explode = (0.1, 0)

#Plot() shows the percentage of schorlasrship recipents amongst males 
color = ['orange','skyblue']
locations = [1, 2]
schship_status = [schshp_m, schshp_diff_m]
labels = ['Yes', 'No']
ax[0].pie(schship_status, autopct = '%1.1f%%', labels = labels, colors = color)
ax[0].legend(loc="upper left")
ax[0].set_title("Percentage of scholarship\nrecipients amongst Males")


#Plot(2) shows the percentage of schorlasrship recipents amongst females 
color = ['orange','skyblue']
locations = [1, 2]
schship_status = [schshp_f, schshp_diff_f]
labels = ['Yes', 'No']
ax[1].pie(schship_status, autopct = '%1.1f%%', labels = labels, colors = color)
ax[1].legend(loc="upper right")
ax[1].set_title("Percentage of scholarship\nrecipients amongst Females");

Some details from the chart include;

Females have a higher percentage of enrollment in wellfare programmes than males. More than twice the percentage of enrolled men to be precise.
Only 5.2% of the male distribution are enrolled in welfare programmes.

Conclusions#

In conclusion, investigation shows through question one that according to the data provided, people who are middleaged(35 to 41) tend to honors appointments more. Although slightly younger observations were highlighted in the male category. Question two investigated the influence of SMS prompts on attendance of appiontment. Analysis shows that SMS prompts do not neccessarily affect the percentage of people that honored their appointments. Lastly, wellfare enrollment distribution amongst gender was investigated and analysis revealed that a lot lesser percentage of men are enrolled in wellfare programmes as opposed to women. However, there is a generally low enrolment rate for accross the dataset.

Limitations#

Data not capturing other gender identities. Male and females are not the only genders existing now. It would’ve helped to explore some peculiarities and differences associated with other gender identities.

References#

1. https://matplotlib.org/ 2. https://pandas.pydata.org/ 3. https://stackoverflow.com/ 4. https://datatofish.com/

from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

Exploratory Data Analysis(EDA)

Contents

Exploratory Data Analysis(EDA)#

Introduction#

Dataset Description#

Question(s) for Analysis#

Dependencies and Data#

Data Properties#

Data Wrangling#

Data Cleaning#

Deduction Summary#

Exploratory Data Analysis#

Conclusions#

Limitations#

References#