Job Description Analysis


Recently, I was contacted for an analyst position with a large healthcare company. To get a feel of the position before the first interview, I started creeping LinkedIn profiles of individuals who currently or previously held this title. After reading over a few profiles, I noticed the diversity and seeming randomness of the position. So I embarked on a small text analysis project.

I took job details of about thirty individuals from LinkedIn and wanted to find out a few buzz words I could research prior to the interview to achieve a proper level of competence.

Getting the data

I began with copying and pasting job details from LinkedIn onto a text document for text analysis. Now, I now copy/paste isn’t the preferred way a data scientist wannabe should go about collecting data, but due to LinkedIn’s privacy awareness, it’s pretty difficult to scrape.

As we’ve done before on this site, we will load the text file as a corpus. (Note we’re working in the same directory the .txt file is found in.)

from nltk.corpus import PlaintextCorpusReader
corpus_root = 'healthcare_job'
job = PlaintextCorpusReader(corpus_root, '.*')
# Import Counter
import collections
from collections import Counter

# Tokenize the article: tokens
tokens = job.words('Untitled.txt')

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Import WordNetLemmatizer
import nltk.stem
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
english_stops = set(stopwords.words('english'))
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
[('data', 47), ('analysis', 31), ('reporting', 17), ('healthcare', 17), ('business', 16), ('medical', 16), ('trend', 16), ('develop', 15), ('analytics', 14), ('project', 13)]

Above are the most common words found on individual’s LinkedIn profile, with their frequency.

Now we’ll plot it (because data viz…).

import matplotlib.pyplot as plt
plt.barh(range(len(bow.most_common(20))),[val[1] for val in bow.most_common(20)], align='center')
plt.yticks(range(len(bow.most_common(20))), [val[0] for val in bow.most_common(20)])

N-gram Frequency

Using the plot_freqdist_freq function which has been defined in a previous post, we will look at the bigrams.

We see ‘star rating’ is most popular, followed by “west region” and “healthcare economics”. (which may or may not be the title of the job we’re researching…)

bgs = nltk.bigrams(lemmatized)
fdist = nltk.FreqDist(bgs)
plot_freqdist_freq(fdist, max_num=15)

But what is “star rating”?

If we pass the tokens into the nltk.Text function, the text becomes an nltk object we can search key words with. Below we see star rating in the context of the job details. Google eventually helped me decipher this program’s purpose.

text = nltk.Text(tokens) # turn text into a NLTK Text object
Displaying 7 of 7 matches:
m a subject matter expert in the CMS Star Rating system , and have a wealth of
nd have a wealth of knowledge of CMS Star Rating Technical Specifications & Gu
, developed , and maintained various star rating datasets , tools , models , a
hip . • Owned and maintained several star rating datasets and tools , many of 
mance metric dashboards for Medicare Star Ratings for Medicare Advantage and P
and reporting pertaining to Medicare Star Ratings . Emphasis on conducting " l
ected effect on the overall contract star rating and subsequent revenue . Exte

The trigram doesn’t display any additional insight. The project references are all from one individual who has presented her research across the nation. (I should get started with research ideas!!)

tgs = nltk.trigrams(lemmatized)
fdist_tri = nltk.FreqDist(tgs)
plot_freqdist_freq(fdist_tri, max_num=15)

Word Cloud

The word cloud is always a “pretty” visual. (I know this particular one isn’t as pretty as others, but it gets the job done.)

from wordcloud import WordCloud, STOPWORDS
# join reviews to a single string
words = ' '.join(tokens)

wordcloud = WordCloud(stopwords=STOPWORDS,

plt.imshow(wordcloud, interpolation="bilinear")


From the word frequency plots and the word cloud, this job defiantly relates to data analysis…

Overall, SAS and SQL are an important tool for analytics projects and reporting. The subject of analysis is often trends, claims, clinical and medicare data. Also, it may be good idea, when discussing this position, to emphasize my ‘love’ of group projects and teamwork.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s