Wrangling and Analyze Data#
Data Gathering#
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import requests
import os
from PIL import Image
from io import BytesIO
import json
weratedogs_archive = pd.read_csv('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv')
weratedogs_archive
Use the Requests library to download the tweet image prediction (image_predictions.tsv)
folder_name = 'image_prediction'
if not os.path.exists(folder_name):
os.makedirs(folder_name)
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
file.write(response.content)
image_prediction = pd.read_csv('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv',sep='\t')
image_prediction
Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)
file1 = open('tweet-json.txt', 'r')
tweet_json = file1.readlines()
tweets_converted = []
for tweet in tweet_json:
tweets_converted.append(json.loads(tweet))
#tweet ID, retweet count, and favorite count."
df_list = []
for tweet in tweets_converted:
tweet_id = int(tweet['id_str'])
retweets = tweet['retweet_count']
number_of_likes = tweet['favorite_count']
# Append to list of dictionaries
df_list.append({'tweet_id': tweet_id,
'retweets': retweets,
'number_of_likes': number_of_likes})
# Create DataFrame from list of dictionaries
tweets = pd.DataFrame(df_list, columns = ['tweet_id', 'retweets', 'number_of_likes'])
tweets
Assessing Data#
#Programmatic Assessment
weratedogs_archive.info()
#Programmatic Assessment
weratedogs_archive.duplicated().sum()
#Programmatic Assessment
image_prediction.info()
#Programmatic Assessment
image_prediction.duplicated().sum()
#Programmatic Assessment
tweets.info()
tweets.duplicated().sum()
Quality issues#
weratedogs_archive
columns:
Visual Assessment
name: this column shows some unrealistic name type like single letters(a, the), there are also missing names as well as duplicated names, some name are proper case while some are all lowercase as observed via visual assessment.
retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp: missing entries for the following variables/columns.
doggo | floofer | pupper | puppo: missing entries for the following columns.
in_reply_to_status_id | in_reply_to_user_id: Observance of missing values from these columns.
floofer: Column name error. Floofer is not a dog stage according to the Dogtionary.
Programmatic Assessment
expanded_urls: during programmatic assessment, some values appear to be missing from this column.
timestamp: data type in this column is wrong. Should be a datetime format data type.
tweet_id: data type in this column is not preffered. Since I won’t be using the figures here to perform any calculations, it is best practice that it is coverted to a string.
Tidiness issues#
weratedogs_archive
columns:
doggo | floofer | pupper | puppo: The following columns violate the first rule of tidiness: that each variable forms a column. They all belong under one variable: stage.
retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp : The following columns are not needed for the current process as majority of the data within are missing.
Cleaning Data#
# Make copies of original pieces of data
weratedogs_archive_clean = weratedogs_archive.copy()
tweets_clean = tweets.copy()
image_prediction_clean = image_prediction.copy()
Tidiness:#
Define: I will be getting rid of the unwanted columns by using the .drop()
method.
Code
#This code drops all columns that are either unwanted from the `weratedogs_archive_clean` dataframe
weratedogs_archive_clean = weratedogs_archive_clean.drop(['retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp',
'expanded_urls','floofer','in_reply_to_status_id','in_reply_to_user_id' ], axis=1)
Test
#checking if all unwanted columns have been successfully removed from the 'weratedogs_archive_clean' dataframe
weratedogs_archive_clean.info()
Re-Gathering:#
Define: I will be re-gathering the dog stages, rating_numerator
, rating_denominator
and name
columns using the .extract()
method.
Code
weratedogs_archive_clean['doggo'] = weratedogs_archive_clean.text.str.extract('(doggo)',expand = True)
weratedogs_archive_clean['pupper'] = weratedogs_archive_clean.text.str.extract('(pupper)',expand = True)
weratedogs_archive_clean['puppo'] = weratedogs_archive_clean.text.str.extract('(puppo)',expand = True)
weratedogs_archive_clean['blep'] = weratedogs_archive_clean.text.str.extract('(blep)',expand = True)
weratedogs_archive_clean['floof'] = weratedogs_archive_clean.text.str.extract('(floof)',expand = True)
weratedogs_archive_clean['snoot'] = weratedogs_archive_clean.text.str.extract('(snoot)',expand = True)
weratedogs_archive_clean['rating_numerator'] = weratedogs_archive_clean.text.str.extract('(\d+\S?\d+)(/\d+)',expand = True)
weratedogs_archive_clean['rating_denominator'] = weratedogs_archive_clean.text.str.extract('(/\d+)(\s+h?)',expand = True)
weratedogs_archive_clean['name'] = weratedogs_archive_clean.text.str.extract('(is\s[A-Z][a-z]+\.)',expand = True)
Test
weratedogs_archive_clean
Quality:#
Define: Cleaning the data re-generated for the name
and rating_denominator
columns.
Code
#cleaning the data re-gathered and testing to see the results
weratedogs_archive_clean.name = weratedogs_archive_clean.name.str[3:-1]
weratedogs_archive_clean.rating_denominator = weratedogs_archive_clean.rating_denominator.str[1:]
Test
weratedogs_archive_clean
Quality:#
Define: Merging all dataframes into one using the .merge()
method on the tweet_id
column.
Code
weratedogs_archive_clean = pd.merge(weratedogs_archive_clean,tweets_clean, on = 'tweet_id', how = 'left')
weratedogs_archive_clean = pd.merge(weratedogs_archive_clean,image_prediction_clean, on = 'tweet_id', how = 'left')
Test
#A check to see if the merge was successful
weratedogs_archive_clean.info()
Tidiness:#
Define: doggo
,floof
, pupper
, puppo
,blep
and snoot
- I will be collapsing these columns into a stage
column using the .melt()
method.
Code
#collapsing the stage variable columns to effect proper structure
weratedogs_archive_clean = pd.melt(weratedogs_archive_clean, id_vars=['tweet_id','timestamp','source','rating_numerator','rating_denominator',
'name','retweets','number_of_likes','text',
'jpg_url','img_num','p1','p1_conf','p1_dog','p2','p2_conf','p2_dog','p3','p3_conf','p3_dog'],
var_name='header', value_name = 'stage')
weratedogs_archive_clean = weratedogs_archive_clean.drop('header', axis =1)
Test
#checking if code implementation was a success
weratedogs_archive_clean
Quality:#
Define: I will be tackling the duplicated rows created by the .melt()
method used in the previous cell using the
.drop_duplicates()
and .drop()
methods.
Code
weratedogs_archive_clean.drop_duplicates(inplace = True)
#querying the duplicated tweet_id with Null values and leaving those with valid entries
mask_null_stage = weratedogs_archive_clean[weratedogs_archive_clean.tweet_id.duplicated(keep = False)]
rows_to_drop = list(mask_null_stage[mask_null_stage.stage.isna()].index)
weratedogs_archive_clean = weratedogs_archive_clean.drop(rows_to_drop)
Test
#this test shows that we still have some form of duplicates hiding in our dataset
weratedogs_archive_clean.info()
Quality:#
Define: I will be tackling the duplicated tweet_id
values which exposed double entry in the stage
column using the
.duplicated().
and .drop()
methods.
Code
#checking if we still have duplicated tweet_ids
weratedogs_archive_clean[weratedogs_archive_clean.tweet_id.duplicated(keep= False)]
#querying and deleting tweet_ids that have multiple stage entries as found in the cell above
double_stage_entry = list(weratedogs_archive_clean[weratedogs_archive_clean.tweet_id.duplicated(keep= False)].index)
weratedogs_archive_clean = weratedogs_archive_clean.drop(double_stage_entry)
Test
weratedogs_archive_clean[weratedogs_archive_clean.tweet_id.duplicated(keep= False)]
Quality:#
Define: Fiding and removing rows with missing image URLs from weratedogs_archive_clean
dataframe.
Code
missing_image_Urls = list(weratedogs_archive_clean[weratedogs_archive_clean.jpg_url.isna()].index)
weratedogs_archive_clean = weratedogs_archive_clean.drop(missing_image_Urls)
Test
weratedogs_archive_clean[weratedogs_archive_clean.jpg_url.isna()]
Quality:#
Define : Filling in the null values present in the retweets
and number_of_likes
columns using the .fillna
method.
Code
#checking the rows with null values for retweets abd number_of_likes columns
weratedogs_archive_clean[weratedogs_archive_clean.number_of_likes.isna()]
#filling the missing values with the average number of retweets and likes in the dataset
weratedogs_archive_clean['retweets'] = weratedogs_archive_clean['retweets'].fillna((weratedogs_archive_clean['retweets'].mean()))
weratedogs_archive_clean['number_of_likes'] = weratedogs_archive_clean['number_of_likes'].fillna((weratedogs_archive_clean['number_of_likes'].mean()))
Test
#checking if there are any null values left in the 'retwweets' and 'number_of_likes' columns
weratedogs_archive_clean[weratedogs_archive_clean.number_of_likes.isna()].size, weratedogs_archive_clean[weratedogs_archive_clean.retweets.isna()].size
Quality:#
Define : Converting all inappropriate data type to preffered data types.
Code
weratedogs_archive_clean.name = weratedogs_archive_clean.name.str.title()
weratedogs_archive_clean.p1_dog = weratedogs_archive_clean.p1_dog.astype(bool)
weratedogs_archive_clean.p2_dog = weratedogs_archive_clean.p2_dog.astype(bool)
weratedogs_archive_clean.p3_dog = weratedogs_archive_clean.p3_dog.astype(bool)
weratedogs_archive_clean.retweets= weratedogs_archive_clean.retweets.astype(int)
weratedogs_archive_clean.number_of_likes= weratedogs_archive_clean.number_of_likes.astype(int)
weratedogs_archive_clean.img_num=weratedogs_archive_clean.img_num.astype(int)
weratedogs_archive_clean.tweet_id = weratedogs_archive_clean.tweet_id.astype(str)
weratedogs_archive_clean.p1 = weratedogs_archive_clean.p1.str.title()
weratedogs_archive_clean.p2 = weratedogs_archive_clean.p2.str.title()
weratedogs_archive_clean.p3 = weratedogs_archive_clean.p3.str.title()
weratedogs_archive_clean.rating_numerator = weratedogs_archive_clean.rating_numerator.astype(float)
weratedogs_archive_clean.rating_denominator = weratedogs_archive_clean.rating_denominator.astype(float)
Test
weratedogs_archive_clean.info()
Quality:#
Define : Filling in the null values present in the rating_numerator
and rating_denominator
columns using the .fillna
method.
Code
#checking the rows with null values for retweets abd rating_numerator columns
weratedogs_archive_clean[weratedogs_archive_clean.rating_numerator.isna()]
#checking the rows with null values for retweets abd rating_denominator columns
weratedogs_archive_clean[weratedogs_archive_clean.rating_denominator.isna()]
#filling the missing values with the average number of rating_numerator and rating_denominator respectively
weratedogs_archive_clean['rating_numerator'] = weratedogs_archive_clean['rating_numerator'].fillna(weratedogs_archive_clean['rating_numerator'].mean())
weratedogs_archive_clean['rating_denominator'] = weratedogs_archive_clean['rating_denominator'].fillna(weratedogs_archive_clean['rating_denominator'].mean())
Test
weratedogs_archive_clean[weratedogs_archive_clean.rating_numerator.isna()].size,weratedogs_archive_clean[weratedogs_archive_clean.rating_denominator.isna()].size
Tidiness:#
Define: Removing all rows with ’False’ dog predictions by the neural network.
Code
false_prediction_rows = list(weratedogs_archive_clean.query("p1_dog == False").index)
weratedogs_archive_clean = weratedogs_archive_clean.drop(false_prediction_rows, axis=0)
Test
weratedogs_archive_clean.info()
Storing Data#
Save gathered, assessed, and cleaned master dataset to a CSV file named “twitter_archive_master.csv”.
weratedogs_archive_clean.to_csv(r'twitter_archive_master.csv', index=False)
Analyzing and Visualizing Data#
In this section, analyze and visualize your wrangled data. You must produce at least three (3) insights and one (1) visualization.
df = pd.read_csv('twitter_archive_master.csv')
#Top viral tweets from the dataframe
df.sort_values(by=['retweets'], ascending = False).head(12)
# Top 10 Tweets with the highest likes
df.sort_values(by=['number_of_likes'], ascending = False).head(10)
#The top five most frequent dog breed predicted by the neural network
df.p1.value_counts().head()
Insights:#
The top ten dog rating tweets with the highest retweets(coverage) in the dataset. On the top of the chart is a dog in its ‘doggo’ stage
The top ten most admired dog rating tweets. On the top of the chart is a dog in its ‘puppo’ stage
The top five most popular dog breeds in the neural network prediction.
Visualization#
stages = df[~(df.stage.isna())]
plt.figure(figsize=(7,7))
plt.hist(stages.stage)
plt.title("Most Popular Dog Stage")
plt.ylabel('Frequency')
plt.xlabel('Stages');