Sign in

R Project 1: Irish Beer Reviews

I have just finished Data Science certification program provided by Harvard University and wanted to practice through an indivisual project. Here is my first project.

There are many breweries around world, but I chose Ireland simply because I like Guinness and it would be fun to discover more about beers produced in Ireland. You can view the R script here.

Data

I used data from Kaggle. The data contains user reviews of beers from all over the world. I filtered reviews only for the Irish beers. The data is slightly outdated, but it is good enough for the practice.

What are the most popular Irish beers?

Note that as the reviews are from outside of Ireland, Guinness Special Export Stout and Guinness Foreign Extra Stout were reviewed mostly, but not the Guinness Draught or Guinness Original/Extra Stout.

# The most-reviewed and highly rated Irish beers
popular_irish_beers <- beer_reviews %>%
group_by(style, beer_name) %>%
mutate(review_num = n(), avg_score = mean(score)) %>%
filter(review_num >= 300 & avg_score >= 3.5) %>%
ungroup()
Most popular Irish Beer overview

What are the distinctive keywords for the top 5 Irish beer styles

Only keeping the five most-reviewed Irish beer styles.

# Five most-reviewed beer styles
styles5 <- beer_reviews %>%
group_by(style) %>%
summarise(review_num = n()) %>%
mutate(style = reorder(style, review_num)) %>%
arrange(desc(review_num)) %>%
top_n(5, review_num)

# subset data for the five most-reviewed beer styles
beer_reviews_sub <- subset(beer_reviews, style %in% styles5$style)

Tidy the review text and count words then use tf-idf to find important words in each beer styles.

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

# count words in the five most-reviewed beer styles
beer_words <- beer_reviews_sub %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word & !str_detect(word, "^[+-]?([0-9]*[.])?[0-9]+")) %>%
count(style, word, sort=TRUE)

# remove words from beer names
exceptions <- c("dark", "light", "fancy", "long", "special", "black", "crafty", "plain")
words_in_beernames <- paste(gsub("[[:punct:]]", "", beer_reviews_sub$beer_name)) %>%
tolower() %>%
strsplit(" | \ ") %>%
unlist() %>%
unique() %>%
c("beer", "craft", "ipa's", "smithwick's", "killian's", "killians","reds","o’hara","Macardle's") %>%
subset(!. %in% exceptions) %>%
sort()

# find most distinctive the words to each style
beer_words_tfidf <- beer_words %>%
filter(!word %in% words_in_beernames) %>%
group_by(word) %>%
mutate(word_num = sum(n)) %>%
bind_tf_idf(word, style, n) %>%
subset(tf_idf > 0) %>%
arrange(desc(tf_idf))

# find top 10 tf_idf words in each style
beer_words_tfidf_10 <- beer_words_tfidf %>%
subset(word_num >= 10) %>%
group_by(style) %>%
top_n(10, tf_idf) %>%
arrange(style, desc(tf_idf)) %>%
head(., 50) %>%
ungroup() %>%
mutate(rank = rep(10:1, 5))
Highest tf-idf words in five most-reviewed Irish beer styles

It is interesting to see the order of “chocolate” and “coffee” are reversed between Foreign/Export Stout and Irish Dray Stout, and English India Pale Ale and European Pale Lager have more negative keywords than the stout.

Correlation between each characteristics of beer

Which characteristics would affect the total review score more? Would it be look, smell or taste? any of the two characteristics are closely correlated?

beer_reviews_scores <- beer_reviews %>%
select("beer_id","style", "look", "smell","taste", "feel","overall") %>%
subset(., style %in% styles5$style)

#find correlation between each characteristics
cor(beer_reviews_scores[, c(3,4,5,6,7)], method = c("pearson", "kendall", "spearman"))
## look smell taste feel overall
## look 1.0000000 0.4756611 0.4779151 0.5000949 0.5311033
## smell 0.4756611 1.0000000 0.7142559 0.5951106 0.6390328
## taste 0.4779151 0.7142559 1.0000000 0.7387318 0.8210998
## feel 0.5000949 0.5951106 0.7387318 1.0000000 0.7412174
## overall 0.5311033 0.6390328 0.8210998 0.7412174 1.0000000

It seems taste has the strongest correlation with the overall score. I plot by beer styles to see if it has a valid linear relationship. For Irish Dry Stout, the linear relationship does not hold, others show higher correlations.

For some beer styles has a strong liner relationship between taste and overall score, but not for all styles.

References:
Julia Silge and David Robinson, Text Mining with R
https://www.tidytextmining.com/tfidf.html

Kaylin Pavli, Tidy Text Mining Beer Reviews https://www.kaylinpavlik.com/tidy-text-beer/

My journey towards data analysis.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store