Text Analysis of LAX’s Yelp Reviews

 

The purpose of this post is to exercise python skills in web scraping and text analysis.

1. Getting the data.

 Using python’s BeautifulSoup library, we can parse through the HTML layout, labeled as the ‘soup’ below, of a website and extract the data we desire.
import bs4 as bs
import urllib.request
url = "https://www.yelp.com/biz/los-angeles-international-airport-lax-los-angeles-2"
source = urllib.request.urlopen(url).read()
# Create the Soup
soup = bs.BeautifulSoup(source,'lxml')

From looking at the soup, we find that the review data is in the json ‘script’ tag. So we extract it and convert from JSON to a dict

import json
data = json.loads(soup.find('script', type='application/ld+json').text)

We now convert the dict into a pandas DataFrame, taking only the columns of interest, and take a look at the data.

import pandas as pd
yelp_columns = ['reviewRating', 'datePublished', 'description','author']
rating_list = pd.DataFrame(data['review'], columns = yelp_columns)
rating_list.head(5)
reviewRating datePublished description author
0 {‘ratingValue’: 4} 2017-11-09 Better than average across the USA. Not as goo… Edith H.
1 {‘ratingValue’: 1} 2017-11-10 Congratulations LAX! You have successfully be… Steve S.
2 {‘ratingValue’: 3} 2017-11-07 Very BUSY, crowded airport! \n\nGot dropped o… Yolanda R.
3 {‘ratingValue’: 1} 2017-10-31 I love LA but I hate its Airport. It’s probabl… Cris I.
4 {‘ratingValue’: 3} 2017-11-12 As anyone that has traveled through LAX knows … Tisch H.

Below is a function to round number of reviews to nearest 20, which is the amount of reviews listed on each yelp page.

def myround(x, base=20):
    return int(base * round(float(x)/base))

Get how many pages of reviews there will be.

page_count = myround(data['aggregateRating']['reviewCount'], 20)
print(page_count/20)
249.0

Make a list of pages to loop over.

pages = list(range(20, page_count, 20))
#print(pages)

Loop through all pages of reviews, appending as we go to the original rating_list data frame, to create one data frame of reviews.

for i in pages :
    url_loop = url + "?start=%d" % (i)
    source_loop = urllib.request.urlopen(url_loop).read()
    soup_loop = bs.BeautifulSoup(source_loop,'lxml')
    data_loop = json.loads(soup_loop.find('script', type='application/ld+json').text)
    #print("On page %s there are %s reviews" % (i, len(data_loop['review'])))
    rating_list_loop = pd.DataFrame(data_loop['review'], columns = yelp_columns)
    rating_list = rating_list.append(rating_list_loop)

Get the shape and take a look at the data again.

print(rating_list.shape)
rating_list.head(5)
(4915, 4)
Out[12]:
reviewRating datePublished description author
0 {‘ratingValue’: 4} 2017-11-09 Better than average across the USA. Not as goo… Edith H.
1 {‘ratingValue’: 1} 2017-11-10 Congratulations LAX! You have successfully be… Steve S.
2 {‘ratingValue’: 3} 2017-11-07 Very BUSY, crowded airport! \n\nGot dropped o… Yolanda R.
3 {‘ratingValue’: 1} 2017-10-31 I love LA but I hate its Airport. It’s probabl… Cris I.
4 {‘ratingValue’: 3} 2017-11-12 As anyone that has traveled through LAX knows … Tisch H.

2. Cleaning and visualizing the data.

There are 4915 reviews available for analyis. Note the numerical ratings, ‘reviewRating’ are in a string. Here we clean up and check on the data.

import re
rating_list.reviewRating = rating_list.reviewRating.astype(str)
rating_list.reviewRating = rating_list.reviewRating.str.extract('(\d+)').astype(int)
rating_list.head(5)
reviewRating datePublished description author
0 4 2017-11-09 Better than average across the USA. Not as goo… Edith H.
1 1 2017-11-10 Congratulations LAX! You have successfully be… Steve S.
2 3 2017-11-07 Very BUSY, crowded airport! \n\nGot dropped o… Yolanda R.
3 1 2017-10-31 I love LA but I hate its Airport. It’s probabl… Cris I.
4 3 2017-11-12 As anyone that has traveled through LAX knows … Tisch H.

Checking the distribution of ratings, we see LAX is not popular among passengers.

import matplotlib.pyplot as plt
import seaborn as sns

rating_list['reviewRating'].plot.hist()
plt.show()
1

Here we add a column for the number of characters in each review.

rating_list['text_length'] = rating_list['description'].apply(len)
rating_list.head(5)
reviewRating datePublished description author text_length
0 4 2017-11-09 Better than average across the USA. Not as goo… Edith H. 515
1 1 2017-11-10 Congratulations LAX! You have successfully be… Steve S. 797
2 3 2017-11-07 Very BUSY, crowded airport! \n\nGot dropped o… Yolanda R. 2524
3 1 2017-10-31 I love LA but I hate its Airport. It’s probabl… Cris I. 823
4 3 2017-11-12 As anyone that has traveled through LAX knows … Tisch H. 396

Create histogram of review text length by numerical rating.

g = sns.FacetGrid(data=rating_list, col='reviewRating')
g.map(plt.hist, 'text_length')
plt.show()
2

Now the same data as before, but in a box-plot format.

sns.boxplot(x='reviewRating', y='text_length', data=rating_list)
plt.show()
3

3. Looking at Subject Frequency in Reviews.

Which are the most talked about airlines?

Now, we define a function, word_in_text(), which will tell us whether the first argument (a word) occurs within the 2nd argument (a text).

In [18]:
def word_in_text(word, text):
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)

    if match:
        return True
    return False
# Initialize list to store review counts
[delta, american, southwest, united, virgin, alaska, lufthansa, british] = [0, 0, 0, 0, 0, 0, 0, 0]

# Iterate through df, counting the number of reviews in which
# each candidate is mentioned
for index, row in rating_list.iterrows():
    delta += word_in_text('delta', row['description'])
    american += word_in_text('american', row['description'])
    southwest += word_in_text('southwest', row['description'])
    united += word_in_text('united', row['description'])
    virgin += word_in_text('virgin', row['description'])
    alaska += word_in_text('alaska', row['description'])
    lufthansa += word_in_text('lufthansa', row['description'])
    british += word_in_text('british', row['description'])
# Set seaborn style
sns.set(color_codes=True)

# Create a list of labels:cd
cd = ['delta', 'american', 'southwest', 'united', 'virgin', 'alaska', 'lufthansa', 'british']

# Plot histogram
ax = sns.barplot(cd, [delta, american, southwest, united, virgin, alaska, lufthansa, british])
ax.set(ylabel="count", xlabel="Airline Mentions")
plt.show()
4

We se that users write more about United and Southwest, than international airlines.

Now we look at customer touch points within the airport.

# Initialize list to store review counts
[food, drinks, parking, restrooms, staff, security] = [0, 0, 0, 0, 0, 0]

# Iterate through df, counting the number of reviews in which
# each candidate is mentioned
for index, row in rating_list.iterrows():
    food += word_in_text('food', row['description'])
    drinks += word_in_text('drinks', row['description'])
    parking += word_in_text('parking', row['description'])
    restrooms += word_in_text('restrooms', row['description'])
    staff += word_in_text('staff', row['description'])
    security += word_in_text('security', row['description'])
# Set seaborn style
sns.set(color_codes=True)

# Create a list of labels:cd
cd2 = ['food', 'drinks', 'parking', 'restrooms', 'staff', 'security']

# Plot histogram
ax = sns.barplot(cd2, [food, drinks, parking, restrooms, staff, security])
ax.set(ylabel="count", xlabel="Customer Touch Points")
plt.show()
5

Secuirty is a major point in users’ reviews. Food comes in second. Looking at the data, their is a dividde among users who complain about too few food options, and users who see the current food options adequate.

4. Word clouds

Let’s take a more visual approach at which words users attribute to LAX.
from wordcloud import WordCloud, STOPWORDS
# join reviews to a single string
words = ' '.join(rating_list['description'])

wordcloud = WordCloud(
                      stopwords=STOPWORDS,
                      background_color='white',
                      width=2000,
                      height=1500
                     ).generate(words)

plt.imshow(wordcloud)
plt.axis('off')
plt.show()
6

“Airport” and “LAX” don’t add much. Let’s make a new word cloud with additional stop words.

stopwords = set(STOPWORDS)
stopwords.add("airport")
stopwords.add("lax")
stopwords.add("terminal")

wordcloud2 = WordCloud(
                      stopwords=stopwords,
                      background_color='white',
                      width=2000,
                      height=1500
                     ).generate(words)
plt.imshow(wordcloud2)
plt.axis('off')
plt.show()
7

Now, we’ll make a word cloud using only one-star reviews.

one_star = rating_list[rating_list['reviewRating'] == 1]
one_star_words = ' '.join(one_star['description'])

one_star_wordcloud = WordCloud(
                      stopwords=stopwords,
                      background_color='black',
                      width=2000,
                      height=1500
                     ).generate(one_star_words)

plt.imshow(one_star_wordcloud)
plt.axis('off')
plt.show()
8

…and now a word cloud for five-star reviews.

five_star = rating_list[rating_list['reviewRating'] == 5]
five_star_words = ' '.join(five_star['description'])

five_star_wordcloud = WordCloud(
                      stopwords=stopwords,
                      background_color='white',
                      width=2000,
                      height=1500
                     ).generate(five_star_words)

plt.imshow(five_star_wordcloud)
plt.axis('off')
plt.show()
9

5. Bar charts for word count

Bar charts display more information than word-clouds.

# Tokenize the string of reviews
import collections
import nltk.stem
from nltk import word_tokenize
from collections import Counter
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

Bar chart for words in all reviews.

tokens = word_tokenize(words)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
english_stops = set(stopwords.words('english'))
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(20))

# turn text into a NLTK Text object
text = nltk.Text(tokens)

# bar chart, most frequent words
plt.barh(range(len(bow.most_common(20))),
[val[1] for val in bow.most_common(20)], align='center')
plt.yticks(range(len(bow.most_common(20))), [val[0] for val in bow.most_common(20)])
plt.show()
[('airport', 7953), ('lax', 5170), ('terminal', 3990), ('get', 3153), ('time', 2670), ('flight', 2615), ('one', 2274), ('like', 2191), ('people', 1934), ('line', 1906), ('security', 1847), ('place', 1596), ('go', 1569), ('traffic', 1499), ('food', 1282), ('always', 1219), ('long', 1216), ('would', 1215), ('international', 1212), ('really', 1160)]
91

What are Yelp users saying about ‘people’?

text.concordance("people")
Displaying 25 of 1926 matches:
d a little stern or aggressive , as people were in line taking off their shoes
e stressful at times with difficult people . The lady in front of me was quite
not that clean , but again A LOT of people traffic walking in/out of the whole
o makes sense . Worst thing is some people traveling do n't pick up or clean u
ing the process longer to encourage people to shell some dough and sign up for
sed by PDX , and fixed here . These people will work very hard to help you out
ible they are here . If the airline people at all of the airports were as awes
the airports were as awesome as the people here , I guarantee you everyone wou
AX , as for the volume of flights & people who go through here , the airport i
 this airport really moves a lot of people within the airport facility . My Ye
rowds do get bothersome , plenty of people have knowingly banged their luggage
ecause flying is never fun and some people are running late or are so caught u
tive , which would have banned LGBT people from serving in California public s
s fairly quick for the thousands of people that are probably on it . Traffic c
verall , it does its job of getting people from point A to point B , but it do
like back in Houston , not too many people flying given summer is almost over 
n getting so much hate from so many people . I guess there can be more food op
... you remember ... all the cots , people getting IV and last rights ? Yeah .
like well , LA has about 11 million people living in it , so it 's sort of to 
, so it 's sort of to be expected . People in general are always frazzled at a
 basic function well . Move lots of people in and out pretty smoothly . Like t
cked like sardines , off-boarding . People sitting on the floor ? ? ? ! Never 
und and sweep up and CLEAN UP AFTER PEOPLE . People ... messy adult people ...
weep up and CLEAN UP AFTER PEOPLE . People ... messy adult people ... wtf and 
TER PEOPLE . People ... messy adult people ... wtf and smh . The only place I
text.similar("tsa")
security lax it airport terminal the there that traffic and this
parking time customs gate here terminals people food you
text.similar("security")
lax it the tsa terminal that traffic there and this you here customs
airport time all but me la i
text.common_contexts(["food", "options"])
food_has airport_and the_here the_in more_to of_to of_and only_is of_i
of_in good_for more_and the_for decent_in the_you few_and fast_i
of_available of_the
text.common_contexts(["security", "bad"])
the_check the_and and_the the_it of_experience from_to are_and bad_is
bad_bad as_the
# bar chart, most frequent words
plt.barh(range(len(bow.most_common(20))),[val[1] for val in bow.most_common(20)], align='center')
plt.yticks(range(len(bow.most_common(20))), [val[0] for val in bow.most_common(20)])
plt.show()
92

One-star word count.

tokens_one_star = word_tokenize(one_star_words)

# Convert the tokens into lowercase: lower_tokens
lower_tokens_one = [t.lower() for t in tokens_one_star]

# Retain alphabetic words: alpha_only
alpha_only_one = [t for t in lower_tokens_one if t.isalpha()]

# Remove all stop words: no_stops
english_stops = set(stopwords.words('english'))
no_stops_one = [t for t in alpha_only_one if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized_one = [wordnet_lemmatizer.lemmatize(t) for t in no_stops_one]

# Create the bag-of-words: bow
bow_one = Counter(lemmatized_one)

# Print the 10 most common tokens
print(bow_one.most_common(20))

# bar chart, most frequent words
plt.barh(range(len(bow_one.most_common(20))),[val[1] for val in bow_one.most_common(20)], align='center')
plt.yticks(range(len(bow_one.most_common(20))), [val[0] for val in bow_one.most_common(20)])
plt.show()
[('airport', 2115), ('lax', 1342), ('terminal', 1043), ('get', 855), ('flight', 709), ('one', 695), ('line', 589), ('time', 580), ('like', 549), ('people', 523), ('place', 479), ('security', 478), ('go', 477), ('hour', 394), ('worst', 377), ('even', 362), ('gate', 357), ('would', 347), ('international', 323), ('take', 323)]
93

Five star word count.

tokens_five_star = word_tokenize(five_star_words)

# Convert the tokens into lowercase: lower_tokens
lower_tokens_five = [t.lower() for t in tokens_five_star]

# Retain alphabetic words: alpha_only
alpha_only_five = [t for t in lower_tokens_five if t.isalpha()]

# Remove all stop words: no_stops
english_stops = set(stopwords.words('english'))
no_stops_five = [t for t in alpha_only_five if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized_five = [wordnet_lemmatizer.lemmatize(t) for t in no_stops_five]

# Create the bag-of-words: bow
bow_five = Counter(lemmatized_five)

# Print the 10 most common tokens
print(bow_five.most_common(20))

# bar chart, most frequent words
plt.barh(range(len(bow_five.most_common(20))),[val[1] for val in bow_five.most_common(20)], align='center')
plt.yticks(range(len(bow_five.most_common(20))), [val[0] for val in bow_five.most_common(20)])
plt.show()
[('airport', 249), ('lax', 168), ('time', 102), ('get', 101), ('people', 90), ('terminal', 86), ('love', 84), ('like', 83), ('flight', 74), ('always', 71), ('one', 68), ('great', 60), ('place', 58), ('tsa', 58), ('security', 56), ('line', 52), ('lot', 44), ('know', 44), ('really', 44), ('international', 44)]
94

6. N-Gram Frequency Distribution

Now we’ll compute frequency distribution for all the bigrams + trigrams in the text.

The plot_freqdist_freq fuction is from : https://martinapugliese.github.io/plotting-the-actual-frequencies-in-a-FreqDist-in-nltk/

def plot_freqdist_freq(fd,
                       max_num=None,
                       cumulative=False,
                       title='Frequency plot',
                       linewidth=2):
    """
    As of NLTK version 3.2.1, FreqDist.plot() plots the counts 
    and has no kwarg for normalising to frequency. 
    Work this around here.
    
    INPUT:
        - the FreqDist object
        - max_num: if specified, only plot up to this number of items 
          (they are already sorted descending by the FreqDist)
        - cumulative: bool (defaults to False)
        - title: the title to give the plot
        - linewidth: the width of line to use (defaults to 2)
    OUTPUT: plot the freq and return None.
    """

    tmp = fd.copy()
    norm = fd.N()
    for key in tmp.keys():
        tmp[key] = float(fd[key]) / norm

    if max_num:
        tmp.plot(max_num, cumulative=cumulative,
                 title=title, linewidth=linewidth)
    else:
        tmp.plot(cumulative=cumulative, 
                 title=title, 
                 linewidth=linewidth)

    return

The bigram is constructed as follows.

bgs = nltk.bigrams(lemmatized)
fdist = nltk.FreqDist(bgs)
plot_freqdist_freq(fdist, max_num=15)
95

Trigram:

tgs = nltk.trigrams(lemmatized)
fdist_tri = nltk.FreqDist(tgs)
plot_freqdist_freq(fdist_tri, max_num=15)
96

Here is the frequency plot of bigrams for five-star reviews. Those who favor this airport find it easy to navigate, and are first time travelers with LAX

bgs_five = nltk.bigrams(lemmatized_five)
fdist_five = nltk.FreqDist(bgs_five)
plot_freqdist_freq(fdist_five, max_num=15)
97
The trigram for five-star reviews:
tgs_five = nltk.trigrams(lemmatized_five)
fdist_five_tri = nltk.FreqDist(tgs_five)
plot_freqdist_freq(fdist_five_tri, max_num=15)
98
Now for one star reviews:
bgs_one = nltk.bigrams(lemmatized_one)
fdist_one = nltk.FreqDist(bgs_one)
plot_freqdist_freq(fdist_one, max_num=15)
99
Trigram for one-star reviews.
tgs_one = nltk.trigrams(lemmatized_one)
fdist_tri_one = nltk.FreqDist(tgs_one)
plot_freqdist_freq(fdist_tri_one, max_num=15)
991

7. Number of reviews given per month.

This could be used as a popularity measurement.
We see the explosion in popularity in mid-to late 2010.

rating_list['datePublished'] = pd.to_datetime(rating_list['datePublished'])
#rating_list.set_index('datePublished').resample('M').count()
rating_list.set_index('datePublished').resample('M').count().plot(kind='line',legend=False)
plt.show()
992

If you wish to save the reviews to a .txt file for further analysis, for example with Voyant Tools, then use the below code.

f = open("words.txt","w")
f.write(words)
f.close()

Conclusion

Overall, LAX is not well liked by travelers. Major complaints are long lines through security. Airport administrators should focus efforts on the Tom Bradley International Terminal first, by optimizing security protocols and offering a more diverse set of dining options in order to please travelers, especially during the holiday seasons.

 

 

 

2 comments

  1. I am not positive where you’re getting your information, but good topic. I needs to spend some time finding out much more or figuring out more. Thank you for excellent information I was on the lookout for this information for my mission.

    Like

  2. whoah this weblog is fantastic i really like studying your articles. Stay up the great paintings! You realize, many persons are searching around for this info, you could aid them greatly.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s