NLP on Laminar Insight

Text Analysis on Youtube Videos Posted About ‘Bangladesh’

Tue, 30 Mar 2021 00:00:00 +0000

Part 01: Background about this project

This analysis is part of an ongoing exploratory study about the contents related to Bangladesh in different online social media platforms. In the last article tweets that containing Bangladesh were analyzed to understand the most common areas that people tweeted about Bangladesh and how public sentiments were reflected. In this article a similar study will be conducted on the videos shared on Youtube that have Bangladesh in their titles. To collect the data YouTube’s publicly available API will be used. Along with the video stats, comments posted by viewers will also be fetched and analyzed.

The analysis is segmented broadly into two phases:

Extracting insights from the overall statistics of the videos and
Extracting insights from the unstructured data (e.g. comments) by applying different text analytics techniques.

Some of the key goals of this analysis are to see:

What kind of videos are posted?
Is there any trend in the number of video posted with the time when they are posted?
What kind of videos are mostly liked and disliked?
Which videos get most traction with their viewers?
Videos around which topics are mostly posted?
Do the trend of video posting vary based on different types of videos?
How do people react or express their sentiment in the comment section?
Do sentiment change over the time?

Part 02: Data collection

The required data for this analysis has been scraped using public API provided by YouTube. Time line considered is the whole year (2017-2018). Scraping YouTube produced a pool of 585 unique videos which have Bangladesh mentioned in their titles. These videos were posted from total 377 different channels. An easy to follow step by step process of how to connect with YouTube using tuber package can be found in this link: https://www.youtube.com/watch?v=NEh5N3OZCXc and the codes used to scrape data for this project can be seen by un-hiding the following code chunk.

# credentials 
app_id = 'Your_app_id'
app_secret = 'your_app_secret'
# establishing connecting with YouTube
yt_oauth(app_id = app_id, app_secret = app_secret)

# ---fetching YouTube videos with 'bangladesh' in the title---
# searching for videos that have 'Bangladesh' in the title
videos_year <- yt_search("Bangladesh", published_after = "2017-6-01T00:00:00Z", published_before = "2018-1-1T00:00:00Z")
# fetching video statistics for all videos (2017-2018)
videostats = lapply(as.character(videos_year$video_id), function(x){
  get_stats(video_id = x)
})
df = ldply(videostats, data.frame)

# merging videos stats with the main file:  videos_year
colnames(df)[1] = 'video_id' # renaming 'id' as video_id so that it matches same coluimn in main table
videos_year = videos_year %>% left_join(df, by = 'video_id')

# correcting data type
videos_year[,c('viewCount', 'likeCount', 'dislikeCount', 'favoriteCount', 'commentCount')]=apply(videos_year[,c('viewCount', 'likeCount', 'dislikeCount', 'favoriteCount', 'commentCount')],2,as.numeric)
# write.csv(videos_year,'youtube_video_raw+stat_2017-2018.csv', row.names = FALSE)

Part 03: Data cleaning

In the last part we have collected video specific statistics and compiled all in a single file named videos_year. A look into the file reveals that it contains 589 observations/rows and 21 variables/columns. Or in other words we have 21 different attributes collected for each of the videos. To look at the variables some basic summary codes have been run and the results are shown below:

videos_year = read.csv('../../../source_files/youtube_video_raw+stat_2017-2018.csv')
dim(videos_year)

## [1] 589  21

names(videos_year)

##  [1] "video_id"                  "publishedAt"              
##  [3] "channelId"                 "title"                    
##  [5] "description"               "thumbnails.default.url"   
##  [7] "thumbnails.default.width"  "thumbnails.default.height"
##  [9] "thumbnails.medium.url"     "thumbnails.medium.width"  
## [11] "thumbnails.medium.height"  "thumbnails.high.url"      
## [13] "thumbnails.high.width"     "thumbnails.high.height"   
## [15] "channelTitle"              "liveBroadcastContent"     
## [17] "viewCount"                 "likeCount"                
## [19] "dislikeCount"              "favoriteCount"            
## [21] "commentCount"

From the variable names we can see that there are data (e.g. thumbnails related data) that are out of scope of this analysis. We can take a look at the summary statistics of the other variables to have an idea about the necessary data cleaning.

str(videos_year[c('publishedAt','channelId','title','description','channelTitle',
                  'liveBroadcastContent','viewCount','likeCount','dislikeCount', 'favoriteCount','commentCount')])

## 'data.frame':    589 obs. of  11 variables:
##  $ publishedAt         : chr  "2017-07-27T22:21:56.000Z" "2017-10-27T15:37:33.000Z" "2017-11-22T14:59:50.000Z" "2017-07-04T06:18:32.000Z" ...
##  $ channelId           : chr  "UCNye-wNBqNL5ZzHSJj3l8Bg" "UCXulruMI7BHj3kGyosNa0jA" "UCXulruMI7BHj3kGyosNa0jA" "UCqlc8Q5Rixjp_zTePTI_mRg" ...
##  $ title               : chr  "<f0><U+009F><U+0087><U+00A7><f0><U+009F><U+0087><U+00A9> Bangladesh's Biggest Brothel | 101 East | <U+092C><U+0"| __truncated__ "HELLO BANGLADESH. DHAKA IS CRAZY." "HOW EXPENSIVE IS BANGLADESH? <f0><U+009F><U+0092><U+00B0><f0><U+009F><U+0087><U+00A7><f0><U+009F><U+0087><U+00A9>" "Bangladesh Biman With  CG For Sefty" ...
##  $ description         : chr  "The biggest brothel in Bangladesh - and possibly the world. The town of Daulatdia is home to 1500 prostitutes, "| __truncated__ "VLOG #146. Let me know your thoughts in the comments section PATREON: https://www.patreon.com/indigotraveller o"| __truncated__ "INSTAGRAM - https://www.goo.gl/hvrnHZ FACEBOOK - https://www.goo.gl/98tqkZ VLOG #156. Let me know your thoughts"| __truncated__ "" ...
##  $ channelTitle        : chr  "Al Jazeera English" "Indigo Traveller" "Indigo Traveller" "Bhaishob Media" ...
##  $ liveBroadcastContent: chr  "none" "none" "none" "none" ...
##  $ viewCount           : int  10456088 297082 400570 2867726 1098161 745213 901613 586726 17040399 249231 ...
##  $ likeCount           : int  21569 6862 12307 13686 7964 16885 6824 2551 67524 2696 ...
##  $ dislikeCount        : int  5862 463 361 1553 732 521 747 560 10818 482 ...
##  $ favoriteCount       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ commentCount        : int  5478 3530 3446 432 342 4243 2024 133 2978 392 ...

Looking at summary stat of other variabes we can identify these issues: 1. Variable publishedA’ contains date data in a wrong format. Which needs to be changed into date format for convenience of future analysis. 2. Variables containing text data ( title and description) need text cleaning since they have HTML codes which don’t have any value in terms of generating meaningful insight In the next three code chunks, these two issues will be fixed.

# converting into data type 'date'
videos_year$publishedAt = as.Date(videos_year$publishedAt)
str(videos_year$publishedAt)

##  Date[1:589], format: "2017-07-27" "2017-10-27" "2017-11-22" "2017-07-04" "2017-08-24" ...

# creating new variables 'year' and 'month'
videos_year = videos_year %>% mutate(month = month(publishedAt)) %>% mutate(year = year(publishedAt))

From above results we can see that data type of variabled named publishedAt has been converted to ‘date’. Two new variables named year and month have been created from the publishedAt variable to use in future analysis. Now we will move on to cleaning the text values in title variable.

head(videos_year$title,5)

## [1] "<f0><U+009F><U+0087><U+00A7><f0><U+009F><U+0087><U+00A9> Bangladesh's Biggest Brothel | 101 East | <U+092C><U+093E><U+0902><U+0917><U+094D><U+0932><U+093E><U+0926><U+0947><U+0936> <U+0915><U+0940> <U+0938><U+092C><U+0938><U+0947> <U+092C><U+0921><U+093C><U+0940> <U+0935><U+0947><U+0936><U+094D><U+092F><U+093E><U+0932><U+092F>"
## [2] "HELLO BANGLADESH. DHAKA IS CRAZY."                                                                                                                                                                                                                                                                                                      
## [3] "HOW EXPENSIVE IS BANGLADESH? <f0><U+009F><U+0092><U+00B0><f0><U+009F><U+0087><U+00A7><f0><U+009F><U+0087><U+00A9>"                                                                                                                                                                                                                      
## [4] "Bangladesh Biman With  CG For Sefty"                                                                                                                                                                                                                                                                                                    
## [5] "Bangladesh Beats West Indies all out for 61 runs (Lowest Score ever)"

The first 10 titles shown above show the uncleaned raw labels or video titles that exist now in the dataset. And the 5 titles shown below are the cleaned titles after cleaning the raw titles using different regular expressions.

# cleaning video titles
videos_year$title= gsub("<.*?>","", videos_year$title) #removing html tags
videos_year$title= gsub("[[:punct:]]", " ", videos_year$title) #removing html tags
videos_year$title = gsub("[ |\t]{2,}", " ", videos_year$title)  # Remove tabs
videos_year$title = gsub("^ ", "", videos_year$title)  # Leading blanks
videos_year$title = gsub(" $", "", videos_year$title)  # Lagging blanks
videos_year$title = gsub(" +", " ", videos_year$title) # General spaces 
videos_year$title = tolower(videos_year$title) # lowering all letters
head(videos_year$title,5)

## [1] "bangladesh s biggest brothel 101 east"                             
## [2] "hello bangladesh dhaka is crazy"                                   
## [3] "how expensive is bangladesh"                                       
## [4] "bangladesh biman with cg for sefty"                                
## [5] "bangladesh beats west indies all out for 61 runs lowest score ever"

At this stage, before enterign into the data analysis we’ll create three more variables that contain the ratios of likes, dislikes and comments against the total views of respective video. Having this ratios will help us in future to extract insights without bias by considering ratios rather than the absolute numbers.

# creating like, dislike, comment ratio
videos_year = videos_year %>%
  mutate(like_ratio = likeCount/viewCount) %>% 
  mutate(dislike_ratio = dislikeCount/viewCount) %>%
  mutate(comment_ratio = commentCount/viewCount) %>% 
  mutate(ratio_to_total = viewCount/sum(viewCount, na.rm = TRUE))

# names(videos_year)

Part 04: Data analysis

To start off, we can look at some basic statistics such as monthly video posting frequency, most viewed, liked, disliked and commented videos and so on.

#month wise video posting (data = videos_year)
videos_year %>%
  group_by(factor(month)) %>%
  mutate(total = n()) %>%
  ungroup() %>%
  ggplot(aes(x = month, y = total)) + 
  geom_line(color = "#27408b") + 
  geom_point(shape=21,fill="white",color="#27408b",size=3,stroke=1.1) + 
  scale_x_continuous(breaks = seq(1,12,1), labels = c("Jan","Feb","Mar","Apr","May",
                                                      "Jun","Jul","Aug","Sep","Oct",
                                                      "Nov","Dec")) +
  labs(x="Month",y="Number of videos",
       title="Month wise frequency of video posting",
       subtitle="subtitle = 'Year 2017 to 2018'")

From above chart we can see that there was a growing trend starting from August which reached at its peak in October and then again gradually fell. Other than August the other two months with unusual spikes are March and June. Now let’s look at the most viewed videos and also see in which month they were posted.

# most viewed videos
videos_year %>% arrange(desc(viewCount)) %>% head(20) %>%
  mutate(title = strtrim(title, 25)) %>%
  mutate(title = reorder(title,viewCount)) %>% top_n(20) %>% 
  ggplot(aes(as.factor(title), (viewCount/1000000), fill = factor(month))) + 
  geom_col()  + 
  scale_x_discrete() +
  coord_flip() +
  ggtitle(label = 'Top 20 most viewed videos') + 
  xlab(label = 'Video title') + 
  ylab(label = 'Number of views (in millions)') +
  labs(fill = 'Month', caption = '* Video titles have been truncated')

From the abundance of paste and green color in the above plot we can immediately tell that most of the most viewed viewed videos were posted during the month of June and July. In addition to that some other observations can also be made from this plot:

There is a wide dispersion among the videos in terms of count ofviews. The highest viewed video has a quite a large gap from the second most viewed video. And a similar gap is observed between the other videos too. Which means the ratio variables created earlier during data cleaning and feature engineering stage will come handy in further analysis by helping us overcome the imbalance in the absolute numbers.
Similar to what we have seen during the analysis of tweets regarding Bangladesh, there are multiple videos related to Cricket in this list of top 20 most viewed videos.
But unlike the last analysis there are multiple videos, including two out of the top five, about brothel in Bangladesh. To understand the reason a manual inspection of the related videos reveals that all the three videos in the top 20 videos are about a specific brothel situated in a small town called Doulatdia in the southern part of Bangladesh. But why would that brothel be a center of attraction for YouTubes? The answer can be found from a closer look at the top three viewed videos. Aljazeera, a middle eastern news channel published a documentary in June, which is the fifth most viewed video on this brothel. The rest of the two videos were posted from another channel with no credible record like Aljazeera.

So what are the other videos about? We already have got a basic understanding on what the top viewed videos were about. Now to have a deeper understanding about the common discussion areas of all the videos, we will use different key word extraction techniques for this purpose.

As a first step, let’s transform our text data into a matrix, more precisely a ‘Document Term Matrix’. Where each word in the text corpus is separated and columns are created for each of these words. Then each sentence is plotted as a row and the columns containing words that are in each docs/sentece get a score of 1 and others 0. Which eventually creates a sparse matrix with lots of zeros. In our case the sparsity is really high (very close to 100%) meaning that there are lots of zeros or in other words there is a wide variety in the titles of the videos.

videos_year$title = removeWords(as.character(videos_year$title), stopwords('english'))
videos_year_title <- enc2utf8(videos_year$title)
corpus = Corpus(VectorSource(videos_year_title))

dtm = DocumentTermMatrix(corpus)
dtm

## <<DocumentTermMatrix (documents: 589, terms: 1552)>>
## Non-/sparse entries: 3715/910413
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)

doc.length = apply(dtm, 1, sum)
dtm = dtm[doc.length > 0,]
inspect(dtm[10:11,])

## <<DocumentTermMatrix (documents: 2, terms: 1552)>>
## Non-/sparse entries: 13/3091
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs aprilia bangladesh bikes border dhaka eskaton expensive ktm new road
##   10       1          1     1      0     1       1         1   1   1    1
##   11       0          1     0      1     0       0         0   0   0    0

From above we can see a summary of our document term matrix. Which shows sparsity of 100% (which should be very close to 100 not exactly 100). We can take a look at some of the sentences in from the matrix. Above 10th and 11th sentences have been fetched as example. We can see that 10th sentence contains all the words except ‘border’ while 11th sentence contains only two words ‘bangladesh’ and ‘border’ from the columns that were fetched as an example.

Now let’s take a look at the most frequently used words.

freq = colSums(as.matrix(dtm))
length(freq)

## [1] 1552

ord = order(freq, decreasing = TRUE)
freq[head(ord, n = 20)]

##  bangladesh        2017       dhaka       india         top     company 
##         614         106          45          44          39          39 
##    skydance       dance    rohingya performance     wedding       match 
##          38          36          34          32          30          29 
##       world         new      bangla        army     myanmar        news 
##          25          23          19          19          18          17 
##       price       video 
##          15          14

Bangladesh and 2017 are the top two most frequent words for obvious reason. The third most frequent word is Dhaka, the capital of Bangladesh. Interestingly we see a similar trend here as we have seen in the our last analysis about tweets, there are quite an interest about the neighboring countries. In this case we see videos that have mentions of India and Myanmar. Also we see rohingya is one of the most frequent words here too. There is a seemingly unusual prevalence of dance related words can be observed among the most frequent words. Looking at the words correlation may give us a better understanding.

#library 'tibble' provides the rowname_to_columns to name row names column
df1=as.data.frame(findAssocs(dtm,'rohingya',.24)) %>% rownames_to_column("words")
df2=as.data.frame(findAssocs(dtm,'dhaka',.24)) %>% rownames_to_column("words")
df3=as.data.frame(findAssocs(dtm,'dance',.30)) %>% rownames_to_column("words")
df4=as.data.frame(findAssocs(dtm,'india',.28)) %>% rownames_to_column("words")
df5=as.data.frame(findAssocs(dtm,'myanmar',.24)) %>% rownames_to_column("words")
df6=as.data.frame(findAssocs(dtm,'match',.40)) %>% rownames_to_column("words")

#correlation values have been varied intentionally to restrict the number of outputs

df = df1 %>% full_join(df2, by = 'words') %>% full_join(df3, by = 'words') %>% full_join(df4, by = 'words') %>% full_join(df5, by = 'words')  %>% full_join(df6, by = 'words')
df = gather(df, "key",'n',2:7)
df %>% ggplot(aes(words, n, fill = factor(key))) + geom_col() + coord_flip() + 
  labs(x = "Counts", y = "Words", fill = "Topic", title = "Topic wise most frequent words")

We have used different levels of correlations randing from 0.24 to 0.40 to restrict our correlated words less than 10 for each key word. Now we have a clearer picture of how our most frequent words relate with other words. Myanmar has come into the scenario mostly because of the issue of Rohingya crisis and discussions related to India were related to border issues and cricket. Dhaka’s, the capital of Bangladesh, correlation with words relatd to area and parks may mean that the videos are mostly about the rcreational areas around Dhaka.

We can also see that these words: skydance, company, wedding, performance very highly correlate (above 0.90) with dance. Which reveals that the videos are most likely about wedding dance from some group called Skydance. Let’s look at the names of the channels that have posted most numbers of videos which may give us some new insight.

videos_year %>% count(channelTitle) %>% arrange(desc(n)) %>%
  mutate(channelTitle = reorder(channelTitle, n)) %>% head(10) %>%
  ggplot(aes(channelTitle, n)) + 
  geom_col()  + 
  scale_x_discrete() +
  coord_flip() +
  ggtitle(label = 'Top 10 chennels with most video uploads') + 
  xlab(label = 'Channel name') + 
  ylab(label = 'Number of videos') +
  labs(fill = 'Ratio to the total video posted')

As we have assumed before, we can see that the channel named ‘Skydance company’ posted the highest number of videos (more than 30) and their high number of videos on wedding program has pushed the the words ‘dancing’, ‘wedding’, ‘performance’ to the list of most frequent words.

Among the other top video posters, presence of two news channels (Al Jazeera and Daily Star, a local one) shows that Bangladesh has got quite a good coverage from the news agencies. It would have been nice to see how Bangladesh is represented in these news videos but in this analysis we won’t focus on news. Maybe this can be a future project to work on! But before moving further into the text mining area we will take a look into the numbers of likes, dislikes and comments to find answers to some of the questions that we asked at the beginning of our analysis: - What are the most liked videos? - What are the most disliked videos? - And which videos got most traction with the viewers through comments?

library(grid)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.0.3

p1=videos_year %>% arrange(desc(likeCount)) %>% head(10) %>%
    mutate(title = strtrim(title, 25)) %>%
  mutate(title = reorder(title,likeCount)) %>%
  ggplot(aes(title, likeCount)) + geom_col()+ xlab(label="")+ ggtitle(label = 'Top 10 most liked videos') +  theme(axis.text.x=element_text(angle=45,hjust=1))

p2=videos_year %>% arrange(desc(dislikeCount)) %>% head(10) %>% 
    mutate(title = strtrim(title, 25)) %>%
  mutate(title = reorder(title,dislikeCount)) %>%
  ggplot(aes(title, dislikeCount)) + geom_col()  +  xlab(label = "")+ ggtitle(label = 'Top 10 most disliked videos') + theme(axis.text.x=element_text(angle=45,hjust=1)) + 
  labs(caption = '* Video titles have been truncated')
  
grid.arrange(p1,p2, ncol = 2)

Interrestingly enough, looking at the lists of 10 most liked and disliked videos we can see some common names! Four out of the ten videos are in the both list of highest liked and disliked videos. Another interesting finding is that the videos with most likes and disliked in the list are made in language other than Bangla, national language of Bangladesh. From that it can be safely assumed that the videos were created by non-Bangladeshi people.

Let’s dig deeper into the relationship between likes and dislikes. We can look at the correlationship between likes and dislikes from a scatter plot.

p1=videos_year %>%
  ggplot(aes(likeCount,dislikeCount)) +
  geom_jitter(alpha = 0.4, shape = 1) + 
  labs(subtitle = 'All records') +  
  xlab(label = 'Count of likes') + ylab(label = 'Count of dislikes')

quantile(videos_year$likeCount, na.rm = TRUE)

##      0%     25%     50%     75%    100% 
##     0.0    43.5   268.0  1241.0 78914.0

quantile(videos_year$dislikeCount, na.rm = TRUE)

##      0%     25%     50%     75%    100% 
##     0.0     4.0    32.0   136.5 11703.0

p2=videos_year %>% filter(likeCount <= 1241 & likeCount >= 44 & dislikeCount <= 137  & dislikeCount >= 4) %>%
  ggplot(aes(likeCount,dislikeCount)) +
  geom_jitter(alpha = 0.4, shape = 1) + 
  labs(subtitle = 'Both lowest and highest quantile values removed') +  
  xlab(label = 'Count of likes') + ylab(label = 'Count of dislikes')

grid.arrange(p1,p2, ncol = 2, top = 'Likes vs Dislikes')

The left chart above represents all the videos. From the high density in the bottom left corner, we can clearly see the high degree of skewness in the data. Which means that there is high level of disparity among the videos in terms of the number of likes and dislikes. Which is similar to what we have observed in case of number of views.

So to overcome this clutter and derive more meaningful insight extreme values from both the lowest and highest extremes (lowest and highest quantile values) were considered only. Looking at the scatter plot in right now we can observe a somewhat linearity can assumed between number of likes and dislikes.Statistically which can be seen from their correlation of 0.64. Which means in 64% of the cases high number of likes co-occur with the high number of dislikes and vice versa.

Now let’s look at the number of comments.

videos_year %>% arrange(desc(commentCount)) %>% head(10) %>%
    mutate(title = strtrim(title, 25)) %>%
  mutate(title = reorder(title, commentCount)) %>%
  ggplot(aes(title, commentCount)) + geom_col()+ ggtitle(label = 'Top 10 most commented videos') +xlab(label="")+ coord_flip() + 
   theme(plot.title = element_text(hjust = -0.45, vjust=2.12)) +
  labs(caption = '* Video titles have been truncated')

Looking at the most commented videos, as expected we can see some common names from the previous charts on most viewed, liked and disliked videos. But how does the relationship between comments and like or dislike look like? Do people comment more when they like the video or it’s opposite? To get an idea we will create consider ratio of like versus dislike and plot it on top of comment chart.

videos_year = videos_year %>% mutate(like_dislike = round(likeCount/dislikeCount),2) 
videos_year %>% arrange(desc(commentCount)) %>% head(10) %>%
    mutate(title = strtrim(title, 25)) %>%
  mutate(title = reorder(title, commentCount)) %>%
  ggplot(aes(title, commentCount, fill = like_dislike)) + geom_col()+ xlab(label="")+ coord_flip() + ggtitle(label = 'Top 10 most commented videos with likes/dislike ratio') + 
  scale_fill_continuous("Likes to Dislikes") + 
  labs(caption = '* Video titles have been truncated')

We can now immediately see the prevalence of extreme colors from both end (dark or very light blue). Which means that majority of the top most commented videos are either highly liked or extremely disliked (extremely dark or extremely light blue color).

Let’s now check both likes and dislikes and their relationship with the number of comments. To do that we will plot like and dislikes on a scatter plot and color code count of comments categorized.

videos_year %>% filter(dislikeCount <3000 & likeCount <20000) %>%
  mutate(comment_cat = ifelse(commentCount <= 1000, '< or = 1000', 
                              ifelse(commentCount <= 2000, '1000 to 2000',
                                     ifelse(commentCount <= 3000, '2000 to 3000', 'above 3000')))) %>% drop_na() %>%
  ggplot(aes(likeCount, dislikeCount, color = comment_cat)) + 
  geom_point(size = 3,alpha = 0.4)  + 
  ggtitle(label = 'Likes vs Dislikes', subtitle = 'Videos with upper extreme values in likes and dislikes not considered') + 
  scale_fill_discrete(name = 'Comment count category')

From the abundance of light red color on the above plot, we can immediately see that there are not a lot of videos that could generate more than 1000 comments. Anone take away from the plot is that videos with higher likes compared to dislikes generates most number of comments, above 3,000 (violet clor), while the videos with higher dislikes compared to likes have total comments of no more than 1000. Moreover, videos with both higher likes and dislikes generate moderately high number of comments, 1,000 to 2,000 (light green). Higher correlation between likes count and comment count (0.73) versus the lower correlation between dislike count and comment count (0.58) also reflects the possibility that video with higher likes tend to have higher number of comments too.

cor(videos_year$likeCount, videos_year$commentCount, use = 'complete')

## [1] 0.7320702

cor(videos_year$dislikeCount, videos_year$commentCount, use = 'complete')

## [1] 0.5827611

At this stage of our analysis we can move onto analyzing the comments a bit further. But due to restriction of the Youtube API on number of records that can be collected, the comments will be collected for only the selective videos. From our initial study on the video titles we have already seen some common areas topics or areas were frequntly presented such as dhaka, india, rohingya, cricket and dance. In the next phase of our analysis we’ll look into these areas except the videos related to dance, since we have already seen that these videos were mostly posted by one channel with insignificant and also didn’t gather much view.

To fetch comments of these selected topics, first of all all the videos in our list have been classified under these four cateogories: cricket, india, regugee, dhaka and dance. To do that total dataset has been subseted using words those have high correlation with the ‘key words’ which are rohingya, india, dhaka, cricket and dance in this case. In the next phase comments will be collected for the videos under these categoris and sentiment analysis will be conducted to draw insight about general sentiments expressed by the audiences around those topics.

videos_year=videos_year %>% 
  mutate(labels = ifelse(grepl(paste(c("rohingya",'refugee','flee','myanmar'), collapse = '|'),title), 'rohingya',
                         (ifelse(grepl("dhaka",title), 'dhaka',
                                 (ifelse(grepl('skydance',title), 'dance',
                                      (ifelse(grepl(paste(c("india",'border','kolkata'),collapse = '|'),title), 'india',
                                              ifelse(grepl(paste(c('cricket','press','post','conference'), collapse = "|"), title),'cricket',NA )))))))))

videos_year=videos_year %>% 
  mutate(labels = ifelse(grepl(paste(df1$words, collapse = '|'),title), 'rohingya',
                         (ifelse(grepl(paste(df2$words, collapse = '|'),title), 'dhaka',
                                 (ifelse(grepl('skydance',title), 'dance',
                                      (ifelse(grepl(paste(df4$words,collapse = '|'),title), 'india',
                                              ifelse(grepl(paste(df6$words, collapse = "|"), title),'cricket',NA )))))))))

summary(factor(videos_year$labels))

##  cricket    dance    dhaka    india rohingya     NA's 
##       23       38       15       22       36      455

Among our total 589 videos from our initial videos, we could make a rough classification of 56 videos. We will further restrict the number of videos to be considered for analysis by taking only the top 10 commented videos from each topic. So that we can still be within the quota of the Youtube API.

# function to fetch comments of specific cateogry of video
fetchComments = function(dataset, keyword){
  df = dataset %>% filter(labels == keyword) %>% 
    arrange(desc(commentCount)) %>% head(10)
  comments = lapply(as.character(df$video_id), function(x){
    get_comment_threads(c(video_id = x), text_format = 'plainText')
  })
  comments = ldply(comments, data.frame) %>% select(videoId, textDisplay, likeCount, publishedAt)
  comments$textDisplay = as.character(comments$textDisplay)
  comments$label = keyword

  return(comments)
  
}


comment_rohingya = fetchComments(videos_year, 'rohingya')
comment_cricket = fetchComments(videos_year, 'cricket')
comment_dhaka = fetchComments(videos_year, 'dhaka')
comment_india = fetchComments(videos_year, 'india')

video_comments = rbind(comment_rohingya,comment_india,comment_dhaka,comment_cricket)
# write.csv(video_comments,'video_comments_top10videos_four_categories.csv', row.names = FALSE)

video_comments=read.csv('../../../source_files/video_comments_top10videos_four_categories.csv')
summary(video_comments$label)

##    Length     Class      Mode 
##      1990 character character

We can see that there are total 1,990 comments collected where most comment generating video topics are rohingya, dhaka, cricket and india in descending order. As we did with our previous dataset of videos_year, we will follow similar steps to clean the date and text variables.

video_comments$publishedAt = as.Date(video_comments$publishedAt) 
video_comments= video_comments %>% mutate(tidy_date = floor_date(publishedAt, unit = "month"))
summary(video_comments$tidy_date)

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2017-02-01" "2017-09-01" "2017-11-01" "2017-11-25" "2018-03-01" "2018-07-01"

In this case in addition to converting the date column, publishedAt, into proper date category, a new column of date has also been creted by converting the date values in publisheAt. The new column tidy_date contains date values grouped into 12 groups containing all dates clubbed under their respective month. For example, videos posted on Feb-01, Feb-19 have been clubbed under Feb-01 and so on. From the summary of this new variable above, we can see that we have records from as early as February 2017 and as late as July 2018.

# cleaning comments
head(video_comments$textDisplay,5)

## [1] "Thank god for Aung San Suu Kyi for doing this.  I live in Yangon and I am so fed up with muslims acting like they own the world.  They are cowards who need to stop their terrorist attacks, or else more of this will happen."
## [2] "You know west is weak, when even the budhists show more balls dealing with islam."                                                                                                                                             
## [3] "Look out people. Evil George soros is using canadian government bob ray to make sure the slaughter continues"                                                                                                                  
## [4] "Humanity has no religion! Bhutanese cruel king n Indian Govt. sent more then 6 hundred thousand refugees to nepal in during 90s they r still in nepal . \nWelcome to nepal rohingya brothers n sister!!"                       
## [5] "George Soros is involved somewhere,i can smell his MO.  If you want real news go to UK Column news.(yt)"

video_comments$textDisplay <- iconv(video_comments$textDisplay, to = "ASCII", sub = " ") # convert to ASCII characters to remove any text written using anything other than plain english letter e.g. bengali words
video_comments$textDisplay= gsub("<.*?>","", video_comments$textDisplay) #removing html tags
video_comments$textDisplay= gsub("[[:punct:]]", " ", video_comments$textDisplay) #removing html tags
video_comments$textDisplay = gsub("[ |\t]{2,}", " ", video_comments$textDisplay)  # Remove tabs
video_comments$textDisplay = gsub("^ ", "", video_comments$textDisplay)  # Leading blanks
video_comments$textDisplay = gsub(" $", "", video_comments$textDisplay)  # Lagging blanks
video_comments$textDisplay = gsub(" +", " ", video_comments$textDisplay) # General spaces 
video_comments$textDisplay = tolower(video_comments$textDisplay) # lowering all letters
head(video_comments$textDisplay,5)

## [1] "thank god for aung san suu kyi for doing this i live in yangon and i am so fed up with muslims acting like they own the world they are cowards who need to stop their terrorist attacks or else more of this will happen"
## [2] "you know west is weak when even the budhists show more balls dealing with islam"                                                                                                                                         
## [3] "look out people evil george soros is using canadian government bob ray to make sure the slaughter continues"                                                                                                             
## [4] "humanity has no religion bhutanese cruel king n indian govt sent more then 6 hundred thousand refugees to nepal in during 90s they r still in nepal \nwelcome to nepal rohingya brothers n sister"                       
## [5] "george soros is involved somewhere i can smell his mo if you want real news go to uk column news yt"

# write.csv(video_comments,'cleaned_video_comments_top10videos_four_categories.csv', row.names = FALSE)

Now moving onto the text cleaning step for the textDisplay column we have cleaned the html tags, lowered all letters, removed punctuation marks. Above we can see sample of texts before the transformation and after the transformation. The first five sentences in the above list shows the text before cleaning steps applied and last five shows the transforemd text after cleaning. Now we’ll calculate sentiment score for each of the sentences using different lexicon libraries. For the purpose, tidytext package will be used which offers the commonly used lexicon libraries (bing, afinn, nrc).

Before moving into the analysis of sentiment scores we’ll take a look at the trend of video posting. Naturally we expect to see videos in different topics being posted at different rate over the period of months.

# calculating labelwise trend of video posting
video_comments %>%
  group_by(tidy_date) %>%
  mutate(total_videos_month = n()) %>%
  ungroup() %>%
  count(tidy_date, label, total_videos_month) %>% 
  mutate(percent_video_label = n/total_videos_month) %>%
  ggplot(aes(tidy_date, percent_video_label, color = label, group = 1)) +         facet_wrap(~label)+ geom_line(size = 1) +
  scale_x_date(date_labels="%b %y",date_breaks  ="2 month") +    
  theme(axis.text.x=element_text(angle=45,hjust=1))+
  ggtitle(label = 'Month wise frequency of comments') + 
  xlab(label = 'Month') + 
  ylab(label = 'Percentage')

video_comments %>%
  group_by(tidy_date) %>%
  mutate(total_videos_month = n()) %>% # each row of video represent a single comment
  ungroup() %>%
  count(tidy_date, label, total_videos_month) %>% 
  mutate(percent_video_label = n/total_videos_month) %>%
  ggplot(aes(tidy_date, percent_video_label, color = label)) +         facet_wrap(~label)+ 
  geom_point(shape = 21, fill = "white", color = "#27408b", size = 2, stroke = 1.1)+
  geom_line(color="#27408b") +
  scale_x_date(date_labels="%b %y",date_breaks  ="2 month") +    
  theme(axis.text.x=element_text(angle=45,hjust=1))+
  ggtitle(label = 'Month wise frequency of comments') + 
  xlab(label = 'Month') + 
  ylab(label = 'Percentage')

# video_comments %>% filter(tidy_date == '2017-06-01' & label == 'india') %>% select(videoId) %>% unique()

One nice thing about the plot above is that, it captures the trend how social media activities spikes and then flattens with the time. For example, the videos posted about rohingya topic had the highest traction on September 2017 around the time when the crisis started. But gradually comments have slowed down since November 2017. On the other hand comments on videos about cricket and india show somewhat resemblance. Till September 2017 both the topics experienced spikes or in other words higher number of comments posted. Later on, the trend has slowed down. We may now look at the trend from the perspective of age of the videos. How did the sentiment of audiences of these videos changed over the life time of the videos? Or does sentiment change as the videos grow old?

# fetching video poblishing date from the old dataset
video_comments = video_comments %>% left_join(videos_year[,c('video_id','publishedAt')], by = c('videoId' = 'video_id'), suffix = c('_comment','_video')) 
# creating new variable with the difference between video posting date and comment posting date
video_comments = video_comments %>% 
  mutate(post_comm_gap = publishedAt_comment - publishedAt_video)

p1=video_comments %>%
  group_by(post_comm_gap) %>%
  mutate(total_videos = n()) %>%
  ungroup() %>% 
  ggplot(aes(post_comm_gap, total_videos)) +         
  facet_wrap(~label)+ geom_jitter() +
  theme(axis.text.x=element_text(angle=45,hjust=1))+
  labs(subtitle = 'All comments') + 
  xlab(label = 'Age of comment') + 
  ylab(label = 'Count')

p2=video_comments %>%
  group_by(post_comm_gap) %>%
  mutate(total_videos = n()) %>%
  ungroup() %>% 
  filter(total_videos < 20) %>%
  ggplot(aes(post_comm_gap, total_videos)) +         
  facet_wrap(~label)+ geom_jitter() +
  theme(axis.text.x=element_text(angle=45,hjust=1))+
  labs(subtitle = 'Less than 20 total comments') + 
  xlab(label = 'Age of comment') + 
  ylab(label = 'Count')

library(gridExtra)
grid.arrange(p1,p2, ncol = 2, top = 'Count of comments vs when they are posted')

From left plot above, we can see that the highest number of comments are posted right after videos are posted (where age of video is ‘zero’ meaning video posted and comment posted dates are same). But since we can’t really make much sense out of the graphs because of the extremely skewed data on y axis, we can consider only the lower values to check if there is any specific trend. Doing that we ended up with the plot on right. Where y axis with less than 20 values were considered.

From the right plot above, we can see that on an average number of comments tend to slow down after 350 days from the date video posted but comments on videos about India seemingly can keep this traction going on further. While the videos about Rohingya issue doesn’t have any comments after 300th days. Which largely because that the oldest video considered here about Rohingya was posted on August 2017. Which barely gives a life span of slightly more than 300 days While in other topics there are videos from the very first month of 2007. But from the trend of comments in other topics we can assume that the comments under the existing videos about rohingya crisis may also take a gradual down turn soon!

Now moving on to the sentiment analysis, we’ll use lexicon based approach. To do that we will look at the comments that viweres left below these videos. We will calculate sentiment score using lexicon library NRC. To explain briefly what a lexicon is, lexicon libraries are stock of words that are prelabeled with the sentiment that they carry. For example: happy would be labeled as positive sentiment while cry as negative sentiment. In The bing lexicon categorizes words in a binary fashion into positive and negative categories.

# creating a new column with the words
video_comments$textDisplay = as.character(video_comments$textDisplay)
token = video_comments %>% 
  unnest_tokens(word, textDisplay, token = 'words', drop = FALSE) 

token %>% select(label, tidy_date, word) %>%
  inner_join(get_sentiments('nrc'), by = 'word') %>% 
  group_by(label,tidy_date) %>% 
  mutate(label_month_total = n()) %>%
  ungroup() %>%
  group_by(label,sentiment,tidy_date) %>%
  mutate(label_month_senti_total = n()) %>%
  ungroup() %>%
  mutate(percent_sentiment = label_month_senti_total/label_month_total) %>% select(label, tidy_date, sentiment, percent_sentiment) %>% unique() %>% 
  ggplot(aes(tidy_date, percent_sentiment, color = factor(sentiment))) + geom_line(size = 1) + facet_wrap(~label) + theme(axis.text.x=element_text(angle=45,hjust=1)) + 
  labs(title = 'Different sentiment trending over the time', x = 'Date', y = 'Percentage', colour = "Sentiments")

# nrc = get_sentiments('nrc')
# summary(factor(nrc$sentiment))

The plots above shows all the sentiments available in the NRC lexicon, which presents an immediate challenge: too cluttered lines to interprete. To make it legible we can club the negative and positive sentiments and plot separtely. But before we do that, if we look back to the plot and try to interprete the lines, we immediately see that there is a prevalence of positive sentiment. Which is because of the nature of the lexicon library. Majority of the words are classified or labeled as positive words which is reflected in the above plot too.

p1 = token %>% select(label, tidy_date, word) %>%
  inner_join(get_sentiments('nrc'), by = 'word') %>% 
  filter(sentiment %in% c('anger','disgust','fear','sadness','negative')) %>%
  group_by(label,tidy_date) %>% 
  mutate(label_month_total = n()) %>%
  ungroup() %>%
  group_by(label,sentiment,tidy_date) %>%
  mutate(label_month_senti_total = n()) %>%
  ungroup() %>%
  mutate(percent_sentiment = label_month_senti_total/label_month_total) %>% select(label, tidy_date, sentiment, percent_sentiment) %>% unique() %>% 
  ggplot(aes(tidy_date, percent_sentiment, color = factor(sentiment))) + geom_line(size = 1) + facet_wrap(~label) + theme(axis.text.x=element_text(angle=45,hjust=1)) + 
  labs(title = 'Negative sentiments trending over the time', x = 'Date', y = 'Percentage', colour = "Sentiments") +
  scale_x_date(date_labels="%b %y",date_breaks  ="2 month") 

p2 = token %>% select(label, tidy_date, word) %>%
  inner_join(get_sentiments('nrc'), by = 'word') %>% 
  filter(!sentiment %in% c('anger','disgust','fear','sadness','negative')) %>%
  group_by(label,tidy_date) %>% 
  mutate(label_month_total = n()) %>%
  ungroup() %>%
  group_by(label,sentiment,tidy_date) %>%
  mutate(label_month_senti_total = n()) %>%
  ungroup() %>%
  mutate(percent_sentiment = label_month_senti_total/label_month_total) %>% select(label, tidy_date, sentiment, percent_sentiment) %>% unique() %>% 
  ggplot(aes(tidy_date, percent_sentiment, color = factor(sentiment))) + geom_line(size = 1) + facet_wrap(~label) + theme(axis.text.x=element_text(angle=45,hjust=1)) + 
  labs(title = 'Positive sentiments trending over the time', x = 'Date', y = 'Percentage', colour = "Sentiments")+
  scale_x_date(date_labels="%b %y",date_breaks  ="2 month") 

grid.arrange(p1,p2, ncol = 2)

Looking at the left plot two interesting trends can be spotted: * About India there has been a sudden growth of anger during the months of June and July 2018. * On the other hand surprisingly negative sentiment about the Rohingya issue is growing on the months of June and July 2018.

Plot on the right kind of reflects the sentiments expressed on the left plot. Where we see a hightened positivity about Dhaka and Rohingya on April 2018.

We are at the very end of our analysis. We will wrap it up with network charts, created with the most frequent words (noun and adjectives), on the comments about India and Rohingya in recent times (after June 2018). From these plots we will try to make a sense about the areas where the unusual spike of negative sentiments were expressed.

#key words extraction from the advising notes
ud_model = udpipe_download_model(language = 'english')
ud_model = udpipe_load_model(ud_model$file_model)

net_plot = function(dataset,label_name,sg) {
  text = dataset %>% filter(label == label_name)
  text = text %>% filter(publishedAt_comment > "2018-06-01")
  text = udpipe_annotate(ud_model, x = text$textDisplay)
  text = as.data.frame(text)
  text$lemma = removeWords(text$lemma, stopwords('english')) 
  text$lemma = removePunctuation(text$lemma) 
  text = text %>% filter(lemma != "")
  stat = cooccurrence(x = subset(text, upos %in% c('NOUN','ADJ')), term = 'lemma',
                      group = c("doc_id", "paragraph_id", "sentence_id"), skipgram = sg)
  wordnetwork <- head(stat,50)
  wordnetwork <- graph_from_data_frame(wordnetwork)
  plot = ggraph(wordnetwork, layout = "fr") +
    geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = 'red') +
    geom_node_text(aes(label = name), col = "darkgreen", size = 4) +
    theme_graph(base_family = "Arial Narrow") +
    theme(legend.position = "none") +
    labs(title = 'Cooccurrent Nouns and Adjectives', 
         subtitle = label_name)
  return(plot)
}

# net_plot(video_comments, 'dhaka',2)
# net_plot(video_comments, 'cricket',2)
 
p1=net_plot(video_comments, 'rohingya',2)
p2=net_plot(video_comments, 'india',1)

grid.arrange(p1,p2,ncol = 2, top = 'Word networks')

From the word network on left we observe that some most frequent phrases are related to muslim such as muslim criminal, muslim burmese and so on. From which it can be assumed that the negativity related to rohingya issue is mostly about the plight of them which may have been triggered by their muslim majority. On the other hand, most frequent phrases around the topic of India are bangladesh border, bangladesh people and so on. Where aparently no specific indication is present from which we can make any assumption about the sudden increase of anger.

Part 05: Ending

From our overall analysis, we have seen that videos with high number of views tend to be about issues that evoke some sort of controverysy (e.g. prank video and videos about brothel in top 20 videos). Interestingly videos that generates higher number of likes also generates higher shares of dislikes too (correlation of 0.6). And when we consider comments as a matrix of viewer traction, most commented videos tend to be on the extremes of likes or dislikes. But most liked videos tend to have higher number of comments compared to the videos with higher number of dislikes. Further we have observed that the videos posted were mostly about some specific areas. From which we picked four topics to explore further. The number of comments vary following different trends in each of the topic areas. Where we have seen that videos realted to India had the longest traction or kept generating comments over a longer period of time. But overall videos seem to reach their end of customer traction at around 350 to 400 days from the day when they are first posted. In the end we looked at the sentiment expressed in the comments by the viewers. Overall, we saw uneven distribution of sentiments over the time. But for some reason we have seen a prevalence of anger about the topic India and negativity about the topic Rohingya in comments in recent time. And as a possible source of recent increase in negative sentiment aroun rohingya topic we indentified the plight of rohingya muslims.

Topic Modeling and Sentiment Analysis on Tweets

Thu, 10 May 2018 00:00:00 +0000

Twitter is a popular source for minning social media posts. In this article I harvested tweets that had mention of ‘Bangladesh’, my home country and ran two specific text analysis: topic modeling and sentiment analysis. The overall goal was to understand which topics related to Bangladesh are popular among the Twitter users and derive some understanding about the sentiments that they expressed through their tweets.

Objective

Breaking down the objective for clear analysis gives us three specific goals:

Around which topics twitter discussions usually circle around,
What is the overall sentiment about Bangladesh that is conveyed by the tweets,
As an extension of the previous two steps: A topic wise sentiment analysis to reveal what kind of sentiments(s) carried by the generally discussed topics.

Data collection

For that study I used public API that is provided from Twitter for twitter analysis. I fetched total 20,000 random twitter posts that were in English and had a mention of ‘Bangladesh’. So I used ‘Bangladesh’ as the search term and collected total 20,000 twitters using R through the public API of Twitter.

Discussion of the methodology

To achieve this objective I applied Latent Dirichlet allocation (LDA) model from topicmodels package in R. LDA model is an unsupervised machine learning algorithm which was first presented as a topic discovery model by David Blei, Andrew Ng, and Michael I. Jordan in 2003. LDA considers a document as a collection of topics. So each word in the document is considered as part of one or more topics. LDA clusters the words under their respective topics. As a statistical model, LDA provides probability of each word to be belonging to a topic and again a probability of each topic to be belonging to each document.
To run LDA, we have to pick number of topics. Since, based on this number LDA breaks down a document and words, in this study, I will try two different total numbers of topics. In LDA model, what could be the total number of topics to be looked for is a balance between granularity versus generalization. More number of topics can provide granularity but may become difficult to divide in clearly segregated topics. On the other hand, less number of topics can be overly generalized and may combine different topics into one. On the later part for sentiment analysis lexicon based sentiment analysis approach was followed. The lexicon used was NRC Emotion Lexicon (EmoLex) which is a crowd-sourced lexicon created by Dr. Saif Mohammad, senior research officer at the National Research Council, Canada. NRC lexicon has a division of words based on 8 prototypical emotions namely: trust, surprise, sadness, joy, fear, disgust, anticipation, and anger and two sentiments: positive and negative. This NRC lexicon was used from the ‘tidytext’ package in R.

Data processing

I have already harvested the tweets and fetched texts from the tweets into text file: ‘bd_tweets_20k.Rds’. So I will skip the initial part of coding showing fetching tweets. Rather I will start by reading the already saved file and then will show the data cleaning and processing step by step.

Reading text file of the tweets:

tweets = readRDS('../../source_files/bd_tweets_20k.Rds')

Before going into the data cleaning step couple of things are to be cleared:

It’s very important to maintain logical order in executing the cleaning commands. Other wise some information can be missed unintentionally. For example, if we convert all the tweets and later on apply ‘gsub’ command to remove retweets with ‘rt’ pattern we may lose part of words that contain ‘rt’. Retweets are marked as RT in the begining but since we converted everything into lower case using ‘tolower’ function, lateron our programs would not be able to detect difference of ‘rt’ for retweet and any other use of ‘rt’ as part of a word. Let’s say there’s a word ‘Part’, after the transormation we’ll only see ‘Pa’ and ‘rt’ part will be replaced by blank.
Throughout the cleaning step it’s a good practice to randomly check the text file to make sure no unexpected transformation takes place. For example, I will view 500th tweet from my file as a benchmark. That tweet I picked arbitrarilly. I will check text of that tweet before starting the data cleaning process and also will view at different points during the cleaning steps.

Here’s our sample tweets:

writeLines(as.character(tweets[[1500]]))

## Half a million Rohingya refugee children at risk in overcrowded camps in Bangladesh with cyclone and… https://t.co/jrp3yEvMJN #Bangladesh

We will revisit our sample tweet at different points during the next data cleaning process.

In the following section I start data cleaning process by converting the text to ASCII format to get rid of the funny characters usually used in Twitter messages. Here is one thing can be noted that these funny characters may contain significant subtle information about sentiment carried by the messages but since it will extend the area covered by this report, it has been skipped here. But it could definitely be a future research area! Before going any further

# Convert to basic ASCII text to avoid silly characters
tweets <- iconv(tweets, to = "ASCII", sub = " ")

On the following code section, I will apply bunch of codes to remove special characters, hyperlink, usernames, tabs, punctuations and unnecessary white spaces. Because all these are not don’t have any relation to the topic modeling. I have mentioned specific use of each code along with the codes below.

tweets <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", tweets)  # Remove the "RT" (retweet) and usernames 
tweets = gsub("http.+ |http.+$", " ", tweets)  # Remove html links
tweets = gsub("http[[:alnum:]]*", "", tweets)
tweets = gsub("[[:punct:]]", " ", tweets)  # Remove punctuation
tweets = gsub("[ |\t]{2,}", " ", tweets)  # Remove tabs
tweets = gsub("^ ", "", tweets)  # Leading blanks
tweets = gsub(" $", "", tweets)  # Lagging blanks
tweets = gsub(" +", " ", tweets) # General spaces

In the above bunch of cleaning codes we have removed html, username and so on. We saw our sample tweet had a html link in it. Let’s check if the transoformation worked properly or not:

writeLines(as.character(tweets[[1500]]))

## Half a million Rohingya refugee children at risk in overcrowded camps in Bangladesh with cyclone and

We see that the punchtuations (.) and website link have been removed from our sample tweet as intended.

I will convert all the tweets in lower case since in R words are case sensitive. For example: ‘Tweets’ and ‘tweets’ are considered as two different words. Moreover, I will remove the duplecate tweets. Among the tweets downloaded using twitter public API there duplicate tweets also exist. To make sure the tweets that are used here are not duplicated now I will remove the duplicated tweets.

tweets = tolower(tweets)
tweets = unique(tweets)
writeLines(as.character(tweets[[1500]]))

## best quality underground metal detector in bangladesh

To check I have extracted the 1500th tweet. But this time I have got a different tweet. Because after removing duplecate tweets I had left with 5,561 tweets out of 20,000 tweets which I started with. So the serial number of tweets have also changed.

As the next step of data processing I will convert this tweets file, which is a character vector, into a corpus. In general term, corpus in linguistic means a structured set of texts that can be used for statistical analysis, hypothesis testing, occurance checking and validating linguistic rules. To To achive this goal I will use ‘corpus’ and ‘VectorSource’ commands from ‘tm’ library in R. While ‘VectorSource’ will interpret each element of our character vector file ‘tweets’ as a document and feed that input into ‘corpus’ command. Which eventually will convert that into corpus suitable for statistical analysis.

library(tm)
corpus <- Corpus(VectorSource(tweets))

I will do some more cleaning on the corpus by removing stop words and numbers because both these have very little value, if there is any, towards our goal of sentiment analylsis and topic modeling. For clarity I will explain a bit more on stop words here before going into coding. Stop words are some extremely common words used in a language which may carry very little value for a particular analysis. In this case I will use the stop words list comes along with ‘tm’ package. To get an idea of the list here are some example of stop words: a, about, above and so on. The exhaustive list can be found in this Github link: https://github.com/arc12/Text-Mining-Weak-Signals/wiki/Standard-set-of-english-stopwords.

corpus <- tm_map(corpus, removeWords, stopwords("english"))  
corpus <- tm_map(corpus, removeNumbers)
writeLines(as.character(corpus[[1500]]))

## best quality underground metal detector  bangladesh

We can see from our sample tweet that a bunch of stop words (in) is removed.

At this step I will convert the words in the corpus into stem words. In general terms, word stemming means the process of reducing a word to its base form which may or may not be the same as the morphological root of the word or may or may not bear meaning by the stem word itself. For example, all these words: ‘fishing’, ‘fisheries’ can be reduced to ‘fish’ by a stemming algorithm. Here ‘fish’ bears a meaning. But on the other hand this bunch of words: ‘argue’, ‘argued’ can be reduced to ‘argu’ in this case the stem doesn’t bear any particular meaning. Stemming a document makes it easier to cluster words and make analysis since. In addition to the stemming I will also delete my search key ‘Bangladesh’ from the tweets. Since I am analyzing tweets containing Bangaldesh and ‘amp’, it’s illogical to keep the term ‘bangladesh’ since that’s the search term and ‘amp’is abbrebiation of ’Accelerated Mobile Page’ which is a part of html link that improved web surfing experience from mobile devices.

corpus <- tm_map(corpus, stemDocument)
corpus = tm_map(corpus, removeWords, c("bangladesh","amp", "will", 'get', 'can'))

writeLines(as.character(corpus[[1500]]))

## best qualiti underground metal detector

Again a recheck of our sample tweet we can see ‘quality’ has been tranformed into their stem form: ‘qualiti’ and ‘bangladesh’ has been removed.

I am finally done with our first step of data cleaning and pre-processing. On the next step I will start data processing to create our topic model. But before diving into model creation I decided to crate a word cloud to get a feel about the data.

library(wordcloud)
set.seed(1234)
palet  = brewer.pal(8, 'Dark2')
wordcloud(corpus, min.freq = 50, scale = c(4, 0.2) , random.order = TRUE, col = palet)

From the resulting word cloud we can see that the words are colored differently, which is based on the frequencies of the words appearing in the tweets. Looking at the most largest two fonts (black and green) we can find these words: rohingya, refuge, today, india, cricket, live. To interpret anything from such word cluster subject knowledge comes handy. Sinice I am from Bangladesh, I know that influx of Rohingya refugee from Myanmar is one of the most recent most discussed issue. Intuitively enough, Rohingya, refuge can be classified as related to the Rohingya crisis. On the other hand Cricket is the most popular game in Bangladesh along with other countries from south asian region. Cricket and Live can be thought to be related to Cricket. India and Today don’t have a general strong association with either of the two primary topics that we have sorted out. We will see how it goes in our further analysis in topic modeling.

As a next processing step now I will convert our corpus in a Document Term Matrix (DTM). DTM creates a matrix that consists all words or terms as an individual column and each document, in our case each tweet, as a row. Numeric value of 1 is assigned to the words that apprear in the document from the corresponding row and value of 0 is assigned to the rest of the words in that row. Thus the resulting DTM file is a sparse which is a large matrix containing a lot of 0.

dtm = DocumentTermMatrix(corpus)
dtm

## <<DocumentTermMatrix (documents: 5561, terms: 8561)>>
## Non-/sparse entries: 44969/47562752
## Sparsity           : 100%
## Maximal term length: 30
## Weighting          : term frequency (tf)

From the file summary of ‘dtm’ file we can see that it contains total 5,561 document, which is the total number of tweets that we have, and total 8,565 term, which shows we have total 8,565 unique words in our tweets. From the non/sparse entries ratio and the percentage of Sparsity we can see that the sparsity of the file, which is not exactly 100 but very close to 100, is very very high which is means lot of words appeard only in few tweets. Let’s inspect the ‘dtm’ file to have a feel about the data.

doc.length = apply(dtm, 1, sum)
dtm = dtm[doc.length > 0,]
dtm

## <<DocumentTermMatrix (documents: 5552, terms: 8561)>>
## Non-/sparse entries: 44969/47485703
## Sparsity           : 100%
## Maximal term length: 30
## Weighting          : term frequency (tf)

inspect(dtm[1:2,10:15])

## <<DocumentTermMatrix (documents: 2, terms: 6)>>
## Non-/sparse entries: 6/6
## Sparsity           : 50%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs dinesh india injur jan within year
##    1      0     0     0   0      1    1
##    2      1     1     1   1      0    0

We can see that out of the five terms first four terms were present in doc 2 and rest 2 terms were present in the doc 1. And accordingly the value of 1 and 0 have been distributed in the cells. Now let’s look at some of the most frequent words in our DTM.

library(dplyr)
freq = colSums(as.matrix(dtm))
length(freq)

## [1] 8561

ord = order(freq, decreasing = TRUE)
freq[head(ord, n = 20)]

## rohingya     news  zimbabw    india   refuge     live  cricket    today 
##      590      509      465      324      308      299      297      296 
##      tri      new     camp     seri pakistan  myanmar    match   nation 
##      266      248      245      239      238      236      221      203 
##   bangla   wicket      odi    peopl 
##      201      187      182      182

From the list of 20 most frequent words we can see that Rohingya crisis and Cricket related terms are the most frequntly used terms. Which shows resemblance with what we saw in our wordcloud. We can now see how different words are associated. Since we see that Cricket and Rohingy are two frequntly used topics, we can try to see which words associate with these two words. For this we will use ‘findAssocs’ command from ‘tm’ package. To run this command we need to provde the benchmark term and then a minimum value of correlation, which can range from 0 to 1.

findAssocs(dtm, "cricket",0.2)

## $cricket
##     cup zimbabw   score    team 
##    0.23    0.22    0.20    0.20

findAssocs(dtm, 'rohingya', 0.2)

## $rohingya
##   refuge     camp  myanmar  repatri    crisi     hous children 
##     0.51     0.46     0.41     0.33     0.25     0.25     0.24

findAssocs(dtm, 'news', 0.2 )

## $news
##   today  latest  bangla   updat januari     atn  decemb ekattor  jamuna ekushey 
##    0.77    0.77    0.76    0.68    0.56    0.53    0.34    0.33    0.28    0.26 
##    post channel 
##    0.21    0.21

From our resulting associations for both the words, we can see Cricket is associated with the words cup, zimbabwe, score and team. Which makes proper sense because every other words except Zimbabwe are related to game and Zimbabwe is one of the common team with whom Bangladesh have had quite a lot of cricket matches (such insights come from subject matter knowledge!). On the other hand, with the word Rohingya we can see assiciated words camp, refugee, myanmar, repatriation etc. evolve around the crisis created by the Rohingya refugees coming from Bangladesh’s neighboring country Myanmar.

I will plot the most frequest 100 words now in a barplot to visually see how their frequencies are distributed. Checking the list of frequent words in list and graphically has two benefits: firstly, it gives a feeling about the analysis and secondly, it puts some sort of control on the quality of data cleaning done in previous steps. For example, after generating the most frequent words I found some of the words such as: amp, will, can, get are not removed. So I went back and added these words in the word remove step of data cleaning.

plot = data.frame(words = names(freq), count = freq)
library(ggplot2)
plot = subset(plot, plot$count > 150) #creating a subset of words having more than 100 frequency
str(plot)
ggplot(data = plot, aes(words, count)) + geom_bar(stat = 'identity') + ggtitle('Words used more than 150 times')+coord_flip()

Topic modeling using LDA

I have used ‘topicmodel’ package available in R for topic modeling. As discussed earlier, in LDA model number of topics are to be selected. Based on which LDA model creates the probability of each topic in each document and also distributes the words under each topic. Selecting more number of topics may result in more grannular segregation but at the same time the differences among different topics may get blurred. While on the other hand selecting very small number of topic can lead to losing possible topic. So to minimze this error I tried three different K or number of topics to create my LDA model. I used 2, 5, 10 as the number of topics and created three different LDA models based on these K values.

library(topicmodels)
#LDA model with 5 topics selected
lda_5 = LDA(dtm, k = 5, method = 'Gibbs', 
          control = list(nstart = 5, seed = list(1505,99,36,56,88), best = TRUE, 
                         thin = 500, burnin = 4000, iter = 2000))

#LDA model with 2 topics selected
lda_2 = LDA(dtm, k = 2, method = 'Gibbs', 
          control = list(nstart = 5, seed = list(1505,99,36,56,88), best = TRUE, 
                         thin = 500, burnin = 4000, iter = 2000))

#LDA model with 10 topics selected
lda_10 = LDA(dtm, k = 10, method = 'Gibbs', 
          control = list(nstart = 5, seed = list(1505,99,36,56,88), best = TRUE, 
                         thin = 500, burnin = 4000, iter = 2000))

LDA model produces a good bulk of information. But getting the most frequent words under each topic and document wise probability of each topic are the two most important pieces of information that I can use for my analysis purpose. First of all I will fetch top 10 terms in each topic:

#Top 10 terms or words under each topic
top10terms_5 = as.matrix(terms(lda_5,10))
top10terms_2 = as.matrix(terms(lda_2,10))
top10terms_10 = as.matrix(terms(lda_10,10))

top10terms_5

##       Topic 1   Topic 2    Topic 3   Topic 4    Topic 5 
##  [1,] "news"    "rohingya" "zimbabw" "india"    "new"   
##  [2,] "live"    "refuge"   "cricket" "pakistan" "one"   
##  [3,] "today"   "camp"     "tri"     "peopl"    "year"  
##  [4,] "bangla"  "myanmar"  "seri"    "countri"  "day"   
##  [5,] "dhaka"   "girl"     "match"   "like"     "see"   
##  [6,] "januari" "children" "nation"  "time"     "just"  
##  [7,] "now"     "muslim"   "wicket"  "indian"   "work"  
##  [8,] "latest"  "say"      "odi"     "take"     "week"  
##  [9,] "updat"   "sex"      "banvzim" "don"      "follow"
## [10,] "love"    "million"  "world"   "nepal"    "last"

top10terms_2

##       Topic 1    Topic 2   
##  [1,] "rohingya" "news"    
##  [2,] "refuge"   "zimbabw" 
##  [3,] "camp"     "india"   
##  [4,] "myanmar"  "live"    
##  [5,] "peopl"    "cricket" 
##  [6,] "girl"     "today"   
##  [7,] "countri"  "tri"     
##  [8,] "children" "new"     
##  [9,] "like"     "seri"    
## [10,] "one"      "pakistan"

top10terms_10

##       Topic 1       Topic 2           Topic 3    Topic 4   Topic 5   Topic 6
##  [1,] "time"        "now"             "india"    "cricket" "zimbabw" "girl" 
##  [2,] "dhaka"       "one"             "pakistan" "world"   "tri"     "sex"  
##  [3,] "bangladeshi" "report"          "like"     "team"    "seri"    "love" 
##  [4,] "make"        "work"            "nepal"    "run"     "match"   "women"
##  [5,] "two"         "watch"           "hindus"   "canada"  "nation"  "nude" 
##  [6,] "islam"       "just"            "south"    "start"   "wicket"  "video"
##  [7,] "high"        "right"           "want"     "day"     "odi"     "kill" 
##  [8,] "state"       "indian"          "think"    "cup"     "first"   "fuck" 
##  [9,] "also"        "mishalhusainbbc" "back"     "score"   "banvzim" "porn" 
## [10,] "much"        "visit"           "take"     "play"    "win"     "dhaka"
##       Topic 7   Topic 8    Topic 9 Topic 10   
##  [1,] "news"    "rohingya" "peopl" "new"      
##  [2,] "today"   "refuge"   "year"  "countri"  
##  [3,] "live"    "camp"     "help"  "see"      
##  [4,] "bangla"  "myanmar"  "need"  "week"     
##  [5,] "januari" "children" "home"  "follow"   
##  [6,] "latest"  "muslim"   "give"  "last"     
##  [7,] "updat"   "say"      "look"  "england"  
##  [8,] "sri"     "million"  "babi"  "australia"
##  [9,] "lanka"   "support"  "sinc"  "bts"      
## [10,] "post"    "repatri"  "even"  "set"

We can see that all three models picked the topics of Cricket and Rohingiya. But models with 5 and 10 topics also picked some other topics anlong with these two. From which we can see the application of the previous discussion about grannularity vs generalization. If we look at the top words from all the topics created from model with 10 topics, we can see that overall there is a lack of coherence among the words inside each topic. Similar observation can be made for the model with 5 topics. While the model with 2 topics provide two topics with a compact coherence among the topics. Another important thing to notice is that how the model with 10 topic picked some topic that were ignored by the model with 2 and 5 topics. Such as nudity (topic-6)!

Since we can clearly see that the topics of ‘Rohingya Crisis’ and ‘Cricket’ are two most common topics, I will move with these topic for further analysis.

Topics found out by our model:

lda.topics_5 = as.matrix(topics(lda_5))
lda.topics_2 = as.matrix(topics(lda_2))
lda.topics_10 = as.matrix(topics(lda_10))
#write.csv(lda.topics_5,file = paste('LDAGibbs',5,'DocsToTopics.csv'))
#write.csv(lda.topics_2,file = paste('LDAGibbs',2,'DocsToTopics.csv'))
#write.csv(lda.topics_10,file = paste('LDAGibbs',10,'DocsToTopics.csv'))

summary(as.factor(lda.topics_5[,1]))

##    1    2    3    4    5 
## 1208 1293 1230 1003  818

summary(as.factor(lda.topics_2[,1]))

##    1    2 
## 3151 2401

summary(as.factor(lda.topics_10[,1]))

##   1   2   3   4   5   6   7   8   9  10 
## 755 659 607 577 708 546 385 593 364 358

We can also get document wise probability of each topic. I have created three files for each of my model and also saved the output for future use. Probability of each topic:

topicprob_5 = as.matrix(lda_5@gamma)
topicprob_2 = as.matrix(lda_2@gamma)
topicprob_10 = as.matrix(lda_10@gamma)

#write.csv(topicprob_5, file = paste('LDAGibbs', 5, 'DoctToTopicProb.csv'))
#write.csv(topicprob_2, file = paste('LDAGibbs', 2, 'DoctToTopicProb.csv'))
#write.csv(topicprob_10, file = paste('LDAGibbs', 10, 'DoctToTopicProb.csv'))

head(topicprob_2,1)

##           [,1]      [,2]
## [1,] 0.5409836 0.4590164

As a sample we can see that according to my model with 5 topics, how document 1 has different probabilities of containing each topic. The highest probability from topic-2. Accordingly the model classified as representing topic-2.

library(tidytext)
library(dplyr)
library(ggplot2)

#Tokenizing character vector file 'tweets'.
token = data.frame(text=tweets, stringsAsFactors = FALSE) %>% unnest_tokens(word, text)

#Matching sentiment words from the 'NRC' sentiment lexicon
senti = inner_join(token, get_sentiments("nrc")) %>%
  count(sentiment)
senti$percent = (senti$n/sum(senti$n))*100

#Plotting the sentiment summary 
ggplot(senti, aes(sentiment, percent)) +   
        geom_bar(aes(fill = sentiment), position = 'dodge', stat = 'identity')+ 
        ggtitle("Sentiment analysis based on lexicon: 'NRC'")+
  coord_flip() +
        theme(legend.position = 'none', plot.title = element_text(size=18, face = 'bold'),
              axis.text=element_text(size=16),
              axis.title=element_text(size=14,face="bold"))

Additional analysis: Sentiment analysis on Rohingya topic

library(tm)
library(quanteda)

## Package version: 2.0.1

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

## The following object is masked from 'package:utils':
## 
##     View

corpus_roh = corpus(tweets)
corpus_roh = (corpus_roh = subset(corpus_roh, grepl('rohingya', texts(corpus_roh))))
writeLines(as.character(corpus_roh[[150]]))

## the suffering in the rohingya refugee camp in bangladesh is indescribable

#Tokenizing character vector file 'tweets'.
library(tidytext)
token_roh = tibble(text=corpus_roh)  %>% unnest_tokens(word, text, format = "text")
 
#Matching sentiment words from the 'NRC' sentiment lexicon
library(dplyr)
senti_roh = inner_join(token_roh, get_sentiments("nrc")) %>%
  count(sentiment)

## Joining, by = "word"

senti_roh$percent = (senti_roh$n/sum(senti_roh$n))*100

#Plotting the sentiment summary 
library(ggplot2)
ggplot(senti_roh, aes(sentiment, percent)) +   
        geom_bar(aes(fill = sentiment), position = 'dodge', stat = 'identity')+ 
        ggtitle("Sentiment analysis based on lexicon: 'NRC'")+
  coord_flip() +
        theme(legend.position = 'none', plot.title = element_text(size=18, face = 'bold'),
              axis.text=element_text(size=16),
              axis.title=element_text(size=14,face="bold"))

Overall finding and discussion

From our random walk on the tweets related to Bangladesh, we have seen ‘cricket’ and ‘Rohingya’ were the two areas that people cared most. Overall, people exuded a positive sentiment along with emotions of trust and anticipation. But in case of Rohingya crisis, we showed a mixed sentiment. About Rohingya issue both the positive and negative sentiments were high. Moreover, the emotions also seemed to be mixed. People felt sorry for the Rohingya people but they also expressed fear. So, what could that mean? Were people sympathetic to the plight of Rohingya but also had some share of fear related to the issue? This study doesn’t allow us to conclude on any such conclusion. But it gives us an idea of what we may achieve by having such a walk. Maybe we need to go for a walk with Bangladesh on the online social media sites more often to get a clearer image on what people talk about Bangladesh and what may needs to be improved to leave a better digital footprint for the country.