Wrangling and Analyze Data#

Open In Colab

Data Gathering#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import requests
import os
from PIL import Image
from io import BytesIO
import json

weratedogs_archive = pd.read_csv('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv')
weratedogs_archive
  1. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

folder_name = 'image_prediction'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)
image_prediction = pd.read_csv('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv',sep='\t')
image_prediction
  1. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

file1 = open('tweet-json.txt', 'r')
tweet_json = file1.readlines()
tweets_converted = []
for tweet in tweet_json:
    tweets_converted.append(json.loads(tweet))
#tweet ID, retweet count, and favorite count."
df_list = []
for tweet in tweets_converted:
        tweet_id = int(tweet['id_str'])
        retweets = tweet['retweet_count']
        number_of_likes = tweet['favorite_count']
        # Append to list of dictionaries
        df_list.append({'tweet_id': tweet_id,
                        'retweets': retweets,
                        'number_of_likes': number_of_likes})
# Create DataFrame from list of dictionaries
tweets = pd.DataFrame(df_list, columns = ['tweet_id', 'retweets', 'number_of_likes'])
tweets

Assessing Data#

#Programmatic Assessment
weratedogs_archive.info()
#Programmatic Assessment
weratedogs_archive.duplicated().sum()
#Programmatic Assessment
image_prediction.info()
#Programmatic Assessment
image_prediction.duplicated().sum()
#Programmatic Assessment
tweets.info()
tweets.duplicated().sum()

Quality issues#

weratedogs_archive columns:

  • Visual Assessment

  1. name: this column shows some unrealistic name type like single letters(a, the), there are also missing names as well as duplicated names, some name are proper case while some are all lowercase as observed via visual assessment.


  1. retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp: missing entries for the following variables/columns.


  1. doggo | floofer | pupper | puppo: missing entries for the following columns.


  1. in_reply_to_status_id | in_reply_to_user_id: Observance of missing values from these columns.


  1. floofer: Column name error. Floofer is not a dog stage according to the Dogtionary.

  • Programmatic Assessment

  1. expanded_urls: during programmatic assessment, some values appear to be missing from this column.


  1. timestamp: data type in this column is wrong. Should be a datetime format data type.


  1. tweet_id: data type in this column is not preffered. Since I won’t be using the figures here to perform any calculations, it is best practice that it is coverted to a string.


Tidiness issues#

weratedogs_archive columns:

  1. doggo | floofer | pupper | puppo: The following columns violate the first rule of tidiness: that each variable forms a column. They all belong under one variable: stage.


  1. retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp : The following columns are not needed for the current process as majority of the data within are missing.

Cleaning Data#

# Make copies of original pieces of data
weratedogs_archive_clean = weratedogs_archive.copy()
tweets_clean = tweets.copy()
image_prediction_clean = image_prediction.copy()

purple-divider

Tidiness:#

Define: I will be getting rid of the unwanted columns by using the .drop() method.

Code

#This code drops all columns that are either unwanted from the `weratedogs_archive_clean` dataframe

weratedogs_archive_clean = weratedogs_archive_clean.drop(['retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp',
                                                          'expanded_urls','floofer','in_reply_to_status_id','in_reply_to_user_id' ], axis=1)

Test

#checking if all unwanted columns have been successfully removed from the 'weratedogs_archive_clean' dataframe
weratedogs_archive_clean.info()

purple-divider

Re-Gathering:#

Define: I will be re-gathering the dog stages, rating_numerator, rating_denominator and name columns using the .extract() method.

Code

weratedogs_archive_clean['doggo'] = weratedogs_archive_clean.text.str.extract('(doggo)',expand = True)
weratedogs_archive_clean['pupper'] = weratedogs_archive_clean.text.str.extract('(pupper)',expand = True)
weratedogs_archive_clean['puppo'] = weratedogs_archive_clean.text.str.extract('(puppo)',expand = True)
weratedogs_archive_clean['blep'] = weratedogs_archive_clean.text.str.extract('(blep)',expand = True)
weratedogs_archive_clean['floof'] = weratedogs_archive_clean.text.str.extract('(floof)',expand = True)
weratedogs_archive_clean['snoot'] = weratedogs_archive_clean.text.str.extract('(snoot)',expand = True)
weratedogs_archive_clean['rating_numerator'] = weratedogs_archive_clean.text.str.extract('(\d+\S?\d+)(/\d+)',expand = True)
weratedogs_archive_clean['rating_denominator'] = weratedogs_archive_clean.text.str.extract('(/\d+)(\s+h?)',expand = True)

weratedogs_archive_clean['name'] = weratedogs_archive_clean.text.str.extract('(is\s[A-Z][a-z]+\.)',expand = True)

Test

weratedogs_archive_clean

purple-divider

Quality:#

Define: Cleaning the data re-generated for the name and rating_denominator columns.

Code

#cleaning the data re-gathered and testing to see the results
weratedogs_archive_clean.name = weratedogs_archive_clean.name.str[3:-1]
weratedogs_archive_clean.rating_denominator = weratedogs_archive_clean.rating_denominator.str[1:]

Test

weratedogs_archive_clean

purple-divider

Quality:#

Define: Merging all dataframes into one using the .merge() method on the tweet_id column.

Code

weratedogs_archive_clean = pd.merge(weratedogs_archive_clean,tweets_clean, on = 'tweet_id', how = 'left')
weratedogs_archive_clean = pd.merge(weratedogs_archive_clean,image_prediction_clean, on = 'tweet_id', how = 'left')

Test

#A check to see if the merge was successful
weratedogs_archive_clean.info()

purple-divider

Tidiness:#

Define: doggo,floof, pupper, puppo,blep and snoot - I will be collapsing these columns into a stage column using the .melt() method.

Code

#collapsing the stage variable columns to effect proper structure
weratedogs_archive_clean = pd.melt(weratedogs_archive_clean, id_vars=['tweet_id','timestamp','source','rating_numerator','rating_denominator',
                                                                      'name','retweets','number_of_likes','text',
                                                                      'jpg_url','img_num','p1','p1_conf','p1_dog','p2','p2_conf','p2_dog','p3','p3_conf','p3_dog'],
                           var_name='header', value_name = 'stage')

weratedogs_archive_clean = weratedogs_archive_clean.drop('header', axis =1)

Test

#checking if code implementation was a success
weratedogs_archive_clean

purple-divider

Quality:#

Define: I will be tackling the duplicated rows created by the .melt() method used in the previous cell using the

.drop_duplicates() and .drop() methods.

Code

weratedogs_archive_clean.drop_duplicates(inplace = True)
#querying the duplicated tweet_id with Null values and leaving those with valid entries
mask_null_stage = weratedogs_archive_clean[weratedogs_archive_clean.tweet_id.duplicated(keep = False)]
rows_to_drop = list(mask_null_stage[mask_null_stage.stage.isna()].index)
weratedogs_archive_clean = weratedogs_archive_clean.drop(rows_to_drop)

Test

#this test shows that we still have some form of duplicates hiding in our dataset
weratedogs_archive_clean.info()

purple-divider

Quality:#

Define: I will be tackling the duplicated tweet_id values which exposed double entry in the stage column using the

.duplicated(). and .drop() methods.

Code

#checking if we still have duplicated tweet_ids
weratedogs_archive_clean[weratedogs_archive_clean.tweet_id.duplicated(keep= False)]
#querying and deleting tweet_ids that have multiple stage entries as found in the cell above
double_stage_entry = list(weratedogs_archive_clean[weratedogs_archive_clean.tweet_id.duplicated(keep= False)].index)
weratedogs_archive_clean = weratedogs_archive_clean.drop(double_stage_entry)

Test

weratedogs_archive_clean[weratedogs_archive_clean.tweet_id.duplicated(keep= False)]

purple-divider

Quality:#

Define: Fiding and removing rows with missing image URLs from weratedogs_archive_clean dataframe.

Code

missing_image_Urls = list(weratedogs_archive_clean[weratedogs_archive_clean.jpg_url.isna()].index)

weratedogs_archive_clean = weratedogs_archive_clean.drop(missing_image_Urls)

Test

weratedogs_archive_clean[weratedogs_archive_clean.jpg_url.isna()]

purple-divider

Quality:#

Define : Filling in the null values present in the retweets and number_of_likes columns using the .fillna method.

Code

#checking the rows with null values for retweets abd number_of_likes columns

weratedogs_archive_clean[weratedogs_archive_clean.number_of_likes.isna()]
#filling the missing values with the average number of retweets and likes in the dataset

weratedogs_archive_clean['retweets'] = weratedogs_archive_clean['retweets'].fillna((weratedogs_archive_clean['retweets'].mean()))
weratedogs_archive_clean['number_of_likes'] = weratedogs_archive_clean['number_of_likes'].fillna((weratedogs_archive_clean['number_of_likes'].mean()))

Test

#checking if there are any null values left in the 'retwweets' and 'number_of_likes' columns
weratedogs_archive_clean[weratedogs_archive_clean.number_of_likes.isna()].size, weratedogs_archive_clean[weratedogs_archive_clean.retweets.isna()].size

purple-divider

Quality:#

Define : Converting all inappropriate data type to preffered data types.

Code

weratedogs_archive_clean.name = weratedogs_archive_clean.name.str.title()
weratedogs_archive_clean.p1_dog = weratedogs_archive_clean.p1_dog.astype(bool)
weratedogs_archive_clean.p2_dog = weratedogs_archive_clean.p2_dog.astype(bool)
weratedogs_archive_clean.p3_dog = weratedogs_archive_clean.p3_dog.astype(bool)
weratedogs_archive_clean.retweets= weratedogs_archive_clean.retweets.astype(int)
weratedogs_archive_clean.number_of_likes= weratedogs_archive_clean.number_of_likes.astype(int)
weratedogs_archive_clean.img_num=weratedogs_archive_clean.img_num.astype(int)
weratedogs_archive_clean.tweet_id = weratedogs_archive_clean.tweet_id.astype(str)
weratedogs_archive_clean.p1 = weratedogs_archive_clean.p1.str.title()
weratedogs_archive_clean.p2 = weratedogs_archive_clean.p2.str.title()
weratedogs_archive_clean.p3 = weratedogs_archive_clean.p3.str.title()
weratedogs_archive_clean.rating_numerator = weratedogs_archive_clean.rating_numerator.astype(float)
weratedogs_archive_clean.rating_denominator = weratedogs_archive_clean.rating_denominator.astype(float)

Test

weratedogs_archive_clean.info()

purple-divider

Quality:#

Define : Filling in the null values present in the rating_numerator and rating_denominator columns using the .fillna method.

Code

#checking the rows with null values for retweets abd rating_numerator columns

weratedogs_archive_clean[weratedogs_archive_clean.rating_numerator.isna()]
#checking the rows with null values for retweets abd rating_denominator columns

weratedogs_archive_clean[weratedogs_archive_clean.rating_denominator.isna()]
#filling the missing values with the average number of rating_numerator and rating_denominator respectively

weratedogs_archive_clean['rating_numerator'] = weratedogs_archive_clean['rating_numerator'].fillna(weratedogs_archive_clean['rating_numerator'].mean())
weratedogs_archive_clean['rating_denominator'] = weratedogs_archive_clean['rating_denominator'].fillna(weratedogs_archive_clean['rating_denominator'].mean())

Test

weratedogs_archive_clean[weratedogs_archive_clean.rating_numerator.isna()].size,weratedogs_archive_clean[weratedogs_archive_clean.rating_denominator.isna()].size

purple-divider

Tidiness:#

Define: Removing all rows with ’False’ dog predictions by the neural network.

Code

false_prediction_rows = list(weratedogs_archive_clean.query("p1_dog == False").index)
weratedogs_archive_clean = weratedogs_archive_clean.drop(false_prediction_rows, axis=0)

Test

weratedogs_archive_clean.info()

Storing Data#

Save gathered, assessed, and cleaned master dataset to a CSV file named “twitter_archive_master.csv”.

weratedogs_archive_clean.to_csv(r'twitter_archive_master.csv', index=False)

Analyzing and Visualizing Data#

In this section, analyze and visualize your wrangled data. You must produce at least three (3) insights and one (1) visualization.

df = pd.read_csv('twitter_archive_master.csv')
#Top viral tweets from the dataframe
df.sort_values(by=['retweets'], ascending = False).head(12)
# Top 10 Tweets with the highest likes

df.sort_values(by=['number_of_likes'], ascending = False).head(10)
#The top five most frequent dog breed predicted by the neural network

df.p1.value_counts().head()

Insights:#

  1. The top ten dog rating tweets with the highest retweets(coverage) in the dataset. On the top of the chart is a dog in its ‘doggo’ stage

  2. The top ten most admired dog rating tweets. On the top of the chart is a dog in its ‘puppo’ stage

  3. The top five most popular dog breeds in the neural network prediction.

Visualization#

stages = df[~(df.stage.isna())]

plt.figure(figsize=(7,7))
plt.hist(stages.stage)
plt.title("Most Popular Dog Stage")
plt.ylabel('Frequency')
plt.xlabel('Stages');