Analyze A/B Test Results#
Introduction#
A/B tests are very commonly performed by data analysts and data scientists. For this project, I will be working to understand the results of an A/B test run by an e-commerce website. The goal is to work through this notebook to help the company understand if they should:
Implement the new webpage,
Keep the old webpage, or
Perhaps run the experiment longer to make their decision.
Part I - Probability#
To get started, let’s import our libraries.
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
import requests
random.seed(42)
ToDo 1.1#
Below is the description of the data, there are a total of 5 columns:
Data columns |
Purpose |
Valid values |
---|---|---|
user_id |
Unique ID |
Int64 values |
timestamp |
Time stamp when the user visited the webpage |
- |
group |
In the current A/B experiment, the users are categorized into two broad groups. |
|
landing_page |
It denotes whether the user visited the old or new webpage. |
|
converted |
It denotes whether the user decided to pay for the company’s product. Here, |
|
a. Read in the dataset from the ab_data.csv
file and take a look at the top few rows here:
url = 'https://video.udacity-data.com/topher/2017/December/5a32c9a0_analyzeabtestresults-2/analyzeabtestresults-2.zip'
# Download the file
response = requests.get(url)
with open("analyzeabtestresults-2.zip", "wb") as file:
file.write(response.content)
# unzip project file into current directory
! unzip "analyzeabtestresults-2.zip"
# read file into panda dataframe
df = pd.read_csv(r"AnalyzeABTestResults 2\ab_data.csv")
b. Use the cell below to find the number of rows in the dataset.
df.shape
df.head()
c. The number of unique users in the dataset.
df['user_id'].nunique()
df['landing_page'].nunique()
d. The proportion of users converted.
df.converted.mean()
e. The number of times when the “group” is treatment
but “landing_page” is not a new_page
.
wrong_match1 = df.query('group == "treatment" and landing_page != "new_page"')['user_id'].count()
wrong_match2 = df.query('group == "control" and landing_page != "old_page"')['user_id'].count()
wrong_match1, wrong_match2
#inspecting the wrong matches in our df
df.groupby(['group','landing_page']).count()
f. Do any of the rows have missing values?
df.info()
ToDo 1.2#
In a particular row, the group and landing_page columns should have either of the following acceptable values:
user_id |
timestamp |
group |
landing_page |
converted |
---|---|---|---|---|
XXXX |
XXXX |
|
|
X |
XXXX |
XXXX |
|
|
X |
It means, the control
group users should match with old_page
; and treatment
group users should matched with the new_page
.
However, for the rows where treatment
does not match with new_page
or control
does not match with old_page
, we cannot be sure if such rows truly received the new or old wepage.
**a.**Create a new dataset that meets the specifications. Store the new dataframe in df2.
# Remove the inaccurate rows, and store the result in a new dataframe df2
wrong_match_index1 = df.query('group == "treatment" and landing_page != "new_page"').index
wrong_match_index2 = df.query('group == "control" and landing_page != "old_page"').index
df.drop(wrong_match_index1, inplace = True)
df.drop(wrong_match_index2, inplace = True)
df2 = df
# Double Check all of the incorrect rows were removed from df2 -
# Output of the statement below should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]
ToDo 1.3#
a. How many unique user_ids are in df2?
df2['user_id'].nunique()
b. There is one user_id repeated in df2. What is it?
df2['user_id'].value_counts()
c. Display the rows for the duplicate user_id?
duplicate_user = df2.query('user_id == 773192')
duplicate_user
d. Remove one of the rows with a duplicate user_id, from the df2 dataframe.
# Remove one of the rows with a duplicate user_id..
# Hint: The dataframe.drop_duplicates() may not work in this case because the rows with duplicate user_id are not entirely identical.
df2.drop(1899,inplace = True)
# Check again if the row with a duplicate user_id is deleted or not
df2.query('user_id == 773192')
ToDo 1.4#
a. What is the probability of an individual converting regardless of the page they receive?
pop_convert_rate = df2.converted.mean()
pop_convert_rate
b. Given that an individual was in the control
group, what is the probability they converted?
df2.query('group == "control"')['converted'].mean()
c. Given that an individual was in the treatment
group, what is the probability they converted?
df2.query('group == "treatment"')['converted'].mean()
# Calculate the actual difference (obs_diff) between the conversion rates for the two groups.
obs_diff = df2.query('group == "treatment"')['converted'].mean() - df2.query('group == "control"')['converted'].mean()
obs_diff
d. What is the probability that an individual received the new page?
df2.query('landing_page == "new_page"')['landing_page'].count()/df2.shape[0]
e. Consider your results from parts (a) through (d) above, and explain below whether the new treatment
group users lead to more conversions.
Based on initial assesment of our data, there is no evidence that suggests that the new treatment group users lead to more conversions
Part II - A/B Test#
Since a timestamp is associated with each event, you could run a hypothesis test continuously as long as you observe the events.
However, then the hard questions would be:
Do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time?
How long do you run to render a decision that neither page is better than another?
These questions are the difficult parts associated with A/B tests in general.
ToDo 2.1#
For now, consider we need to make the decision just based on all the data provided.
Recall that we just calculated that the “converted” probability (or rate) for the old page is slightly higher than that of the new page (ToDo 1.4.c).
If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should be your null and alternative hypotheses (\(H_0\) and \(H_1\))?
You can state your hypothesis in terms of words or in terms of \(p_{old}\) and \(p_{new}\), which are the “converted” probability (or rate) for the old and new pages respectively.
\(H_0\) : \(p_{new}\) - \(p_{old}\) <= 0
\(H_1\) : \(p_{new}\) - \(p_{old}\) > 0
ToDo 2.2 - Null Hypothesis \(H_0\) Testing#
Under the null hypothesis \(H_0\), assume that \(p_{new}\) and \(p_{old}\) are equal. Furthermore, assume that \(p_{new}\) and \(p_{old}\) both are equal to the converted success rate in the df2
data regardless of the page. So, our assumption is:
In this section, I will:
Simulate (bootstrap) sample data set for both groups, and compute the “converted” probability \(p\) for those samples.
Use a sample size for each group equal to the ones in the
df2
data.Compute the difference in the “converted” probability for the two samples above.
Perform the sampling distribution for the “difference in the converted probability” between the two simulated-samples over 10,000 iterations; and calculate an estimate.
a. What is the conversion rate for \(p_{new}\) under the null hypothesis?
p_new = pop_convert_rate
p_new
b. What is the conversion rate for \(p_{old}\) under the null hypothesis?
p_old = pop_convert_rate
p_old
c. What is \(n_{new}\), the number of individuals in the treatment group?
n_new = df2.query('group == "treatment"')['user_id'].nunique()
n_new
d. What is \(n_{old}\), the number of individuals in the control group?
n_old = df2.query('group == "control"')['user_id'].nunique()
n_old
e. Simulate Sample for the treatment
Group
Simulate \(n_{new}\) transactions with a conversion rate of \(p_{new}\) under the null hypothesis.
# Simulate a Sample for the treatment Group
new_page_converted = np.random.choice([0,1], size=n_new, p=[1-p_new, p_new])
new_page_converted
f. Simulate Sample for the control
Group
Simulate \(n_{old}\) transactions with a conversion rate of \(p_{old}\) under the null hypothesis.
Store these \(n_{old}\) 1’s and 0’s in the old_page_converted
numpy array.
# Simulate a Sample for the control Group
old_page_converted = np.random.choice([0,1], size=n_old, p=[1-p_old, p_old])
old_page_converted
g. Find the difference in the “converted” probability \((p{'}_{new}\) - \(p{'}_{old})\) for your simulated samples from the parts (e) and (f) above.
p_new2 = new_page_converted.mean()
p_old2 = old_page_converted.mean()
p_new2 - p_old2
h. Sampling distribution
Re-create new_page_converted
and old_page_converted
and find the \((p{'}_{new}\) - \(p{'}_{old})\) value 10,000 times using the same simulation process you used in parts (a) through (g) above.
Store all $(p{'}_{new}$ - $p{'}_{old})$ values in a NumPy array called `p_diffs`.
new_page_converted = np.random.binomial(n_new, p_new, 10000)/n_new
old_page_converted = np.random.binomial(n_old, p_old, 10000)/n_old
p_diffs = new_page_converted - old_page_converted
i. Histogram
Plot a histogram of the p_diffs. Does this plot look like what is expected?
Also, we use plt.axvline()
method to mark the actual difference observed in the df2
data (recall obs_diff
), in the chart.
p_diffs = np.array(p_diffs)
plt.figure(figsize=(10,7))
plt.hist(p_diffs)
plt.title("Difference in Conversion Rates")
plt.ylabel('Frequency')
plt.xlabel('Simulated Differences');
plt.axvline(obs_diff,color='r', linewidth=2);
j. What proportion of the p_diffs are greater than the actual difference observed in the df2
data?
(p_diffs > obs_diff).mean()
k. Explaining the calculated value in j above.
What is this value called in scientific studies?
What does this value signify in terms of whether or not there is a difference between the new and old pages?
The value calculated is called the P value - If \(H_0\) is true, the probablity of obtaining the observed statistic or an etreme value in favour of the \(H_1\).
The P value calculated suggests that the statistic(obs_diff) is likely from the \(H_0\). In order words, the threshold for a type one error is 5% and since our P value is greater than the error threshold, \(H_0\) remains valid.
l. Using Built-in Methods for Hypothesis Testing
We could also use a built-in to achieve similar results. Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance.
Fill in the statements below to calculate the:
convert_old
: number of conversions with the old_pageconvert_new
: number of conversions with the new_pagen_old
: number of individuals who were shown the old_pagen_new
: number of individuals who were shown the new_page
import statsmodels.api as sm
# number of conversions with the old_page
convert_old = df2.query('group == "control"')['converted'].sum()
# number of conversions with the new_page
convert_new = df2.query('group == "treatment"')['converted'].sum()
# number of individuals who were shown the old_page
n_old = df2.query('landing_page == "old_page"')['landing_page'].count()
# number of individuals who received new_page
n_new = df2.query('landing_page == "new_page"')['landing_page'].count()
m. Now use sm.stats.proportions_ztest()
to compute your test statistic and p-value. Here is a helpful link on using the built in.
The syntax is:
proportions_ztest(count_array, nobs_array, alternative='larger')
where,
count_array
= represents the number of “converted” for each groupnobs_array
= represents the total number of observations (rows) in each groupalternative
= choose one of the values from[‘two-sided’, ‘smaller’, ‘larger’]
depending upon two-tailed, left-tailed, or right-tailed respectively.
Hint:
It’s a two-tailed if you defined \(H_1\) as \((p_{new} = p_{old})\).
It’s a left-tailed if you defined \(H_1\) as \((p_{new} < p_{old})\).
It’s a right-tailed if you defined \(H_1\) as \((p_{new} > p_{old})\).
The built-in function above will return the z_score, p_value.
About the two-sample z-test#
Recall that you have plotted a distribution p_diffs
representing the
difference in the “converted” probability \((p{'}_{new}-p{'}_{old})\) for your two simulated samples 10,000 times.
Another way for comparing the mean of two independent and normal distribution is a two-sample z-test. You can perform the Z-test to calculate the Z_score, as shown in the equation below:
where,
\(p{'}\) is the “converted” success rate in the sample
\(p_{new}\) and \(p_{old}\) are the “converted” success rate for the two groups in the population.
\(\sigma_{new}\) and \(\sigma_{new}\) are the standard deviation for the two groups in the population.
\(n_{new}\) and \(n_{old}\) represent the size of the two groups or samples (it’s same in our case)
Z-test is performed when the sample size is large, and the population variance is known. The z-score represents the distance between the two “converted” success rates in terms of the standard error.
Next step is to make a decision to reject or fail to reject the null hypothesis based on comparing these two values:
\(Z_{score}\)
\(Z_{\alpha}\) or \(Z_{0.05}\), also known as critical value at 95% confidence interval. \(Z_{0.05}\) is 1.645 for one-tailed tests, and 1.960 for two-tailed test. You can determine the \(Z_{\alpha}\) from the z-table manually.
Decide if your hypothesis is either a two-tailed, left-tailed, or right-tailed test. Accordingly, reject OR fail to reject the null based on the comparison between \(Z_{score}\) and \(Z_{\alpha}\).
Hint:
For a right-tailed test, reject null if \(Z_{score}\) > \(Z_{\alpha}\).
For a left-tailed test, reject null if \(Z_{score}\) < \(Z_{\alpha}\).
In other words, we determine whether or not the \(Z_{score}\) lies in the “rejection region” in the distribution. A “rejection region” is an interval where the null hypothesis is rejected iff the \(Z_{score}\) lies in that region.
Reference:
Example 9.1.2 on this page, courtesy www.stats.libretexts.org
import statsmodels.api as sm
# ToDo: Complete the sm.stats.proportions_ztest() method arguments
count_array = np.array([convert_new,convert_old])
nobs_array = np.array([n_new,n_old])
z_score, p_value = sm.stats.proportions_ztest(count_array,nobs_array,alternative ='larger' )
print(z_score, p_value)
n. What do the z-score and p-value you computed in the previous question mean for the conversion rates of the old and new pages? Do they agree with the findings in parts j. and k.?
The P value calculated using the inbuilt method agrees with previous approach where the calculated P value is above a type 1 error threshold(\({\alpha}\)). Hence, there isn’t a significant evidence to reject the \(H_0\).
The Z score value agrees that we fail to reject the \(H_0\). Having seen that the \(Z_{score}\) < \(Z_{\alpha}\).
Part III - A regression approach#
ToDo 3.1#
In this final part, you will see that the result achieved in the A/B test in Part II above can also be achieved by performing regression.
a. Since each row in the df2
data is either a conversion or no conversion, what type of regression should you be performing in this case?
Logistic Regression.
b. The goal is to use statsmodels library to fit the regression model you specified in part a. above to see if there is a significant difference in conversion based on the page-type a customer receives. However, you first need to create the following two columns in the df2
dataframe:
intercept
- It should be1
in the entire column.ab_page
- It’s a dummy variable column, having a value1
when an individual receives the treatment, otherwise0
.
df2['intercept'] = 1
df2[['other_page','ab_page']] = pd.get_dummies(df2['group'])
df2.head()
c. Use statsmodels to instantiate the regression model on the two columns you created in part (b). above, then fit the model to predict whether or not an individual converts.
logit_mod =sm.Logit(df2['converted'],df2[['intercept', 'ab_page']])
results = logit_mod.fit()
d. Provide the summary of your model below, and use it as necessary to answer the following questions.
results.summary2()
1/np.exp(-0.0150)
e. What is the p-value associated with ab_page? Why does it differ from the value you found in Part II?
The P value associated with ab_page is 0.1899 which agrees with not to reject the \(H_0\).
The \(H_0\) of our logistic regression model(LRM) states that none of the explanatory variable(s) have a statistically significant relationship with the response variable(“conversion” rate), y. The \(H_1\) of our LRM states that there is a statistically significant relationship between conversion rates and the ab_page column. In order words, \(H_0:B_1\) \(=\) \(0\) and \(H_1:B_1\) \(!=\) \(0\).
In comparison to the \(H_0\) and \(H_1\) in Part II, Part II investigated same outcome as this current section, where insight was derived from our two competing hypothesis with the basis of the difference in the average conversion rate for both idependent variables(groups). In both cases, we fail to reject the \(H_0\).
f. Now, you are considering other things that might influence whether or not an individual converts. Discuss why it is a good idea to consider other factors to add into your regression model. Are there any disadvantages to adding additional terms into your regression model?
An advantage of using multiple explanantory variables is the ability to determine the relative influence of one or more explanantory variables to the criterion value.
We can’t be entirely reliant on one categorical data due to the influence of distribution bias where class distribution have significant difference in share per category. Considering other explanantory variables or a combination of explanatory variables will suffice. However, it is important to pay attention to the presence of Multicollinearity amongst predictor variables.
g. Adding countries
Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives in.
You will need to read in the countries.csv dataset and merge together your
df2
datasets on the appropriate rows. You call the resulting dataframedf_merged
. Here are the docs for joining tables.Does it appear that country had an impact on conversion? To answer this question, consider the three unique values,
['UK', 'US', 'CA']
, in thecountry
column. Create dummy variables for these country columns.
# Read the countries.csv
df3 = pd.read_csv(r'AnalyzeABTestResults 2\countries.csv')
df3.head()
# Join with the df2 dataframe
df_merged = df2.merge(df3,on = 'user_id')
df_merged.head()
# Create the necessary dummy variables
df_merged[[ 'CA','UK', 'US']] = pd.get_dummies(df_merged['country'])
df_merged.head()
logit_mod =sm.Logit(df_merged['converted'],df_merged[['intercept','US','UK']])
results = logit_mod.fit()
results.summary2()
#making the summary coefficients interpretable by exponentiating.
np.exp(0.0408), np.exp(0.0507)
Consequently, conversion rates are 1.042 times more likely for users living in Canada as compared to users that live in USA, holding all other variables constant.
Conversion rates are 1.052 times more likely for users living in Canada as compared to users that live in UK, holding all other variables constant.
h. Fit your model and obtain the results
Though I have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if are there significant effects on conversion. Create the necessary additional columns, and fit the new model.
Provide the summary results (statistical output), and your conclusions (written response) based on the results.
# Fit your model, and summarize the results
df_merged['UK_new'] = df_merged['UK'] * df_merged['ab_page']
df_merged['US_new'] = df_merged['US'] * df_merged['ab_page']
logit_mod =sm.Logit(df_merged['converted'],df_merged[['intercept','ab_page','US','UK','UK_new','US_new']])
results = logit_mod.fit()
results.summary2()
The introduction of the country explanatory variable(s) complements previous findings that there isn’t a statiscal significance enough to reject our null hypothesis.
All calculated P values present in the summary indicates no significance as they are all higher than the 5% probability threshold of a type 1 error.
Recommendation#
All evidence suggests that keeping the old webpage is the most profitable decision to make. The new webpage simply doesn’t outperfrom the exisiting webpage.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])