Recently, I was contacted for an analyst position with a large healthcare company. To get a feel of the position before the first interview, I started creeping LinkedIn profiles of individuals who currently or previously held this title. After reading over a few profiles, I noticed the diversity and seeming randomness of the position. So I embarked on a small text analysis project.
I took job details of about thirty individuals from LinkedIn and wanted to find out a few buzz words I could research prior to the interview to achieve a proper level of competence.
Getting the data
As we’ve done before on this site, we will load the text file as a corpus. (Note we’re working in the same directory the .txt file is found in.)
from nltk.corpus import PlaintextCorpusReader corpus_root = 'healthcare_job' job = PlaintextCorpusReader(corpus_root, '.*')
# Import Counter import collections from collections import Counter # Tokenize the article: tokens tokens = job.words('Untitled.txt') # Convert the tokens into lowercase: lower_tokens lower_tokens = [t.lower() for t in tokens] # Import WordNetLemmatizer import nltk.stem from nltk.stem import WordNetLemmatizer from nltk.corpus import stopwords # Retain alphabetic words: alpha_only alpha_only = [t for t in lower_tokens if t.isalpha()] # Remove all stop words: no_stops english_stops = set(stopwords.words('english')) no_stops = [t for t in alpha_only if t not in english_stops] # Instantiate the WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() # Lemmatize all tokens into a new list: lemmatized lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops] # Create the bag-of-words: bow bow = Counter(lemmatized) # Print the 10 most common tokens print(bow.most_common(10))
[('data', 47), ('analysis', 31), ('reporting', 17), ('healthcare', 17), ('business', 16), ('medical', 16), ('trend', 16), ('develop', 15), ('analytics', 14), ('project', 13)]
Above are the most common words found on individual’s LinkedIn profile, with their frequency.
Now we’ll plot it (because data viz…).
import matplotlib.pyplot as plt plt.barh(range(len(bow.most_common(20))),[val for val in bow.most_common(20)], align='center') plt.yticks(range(len(bow.most_common(20))), [val for val in bow.most_common(20)]) plt.show()
Using the plot_freqdist_freq function which has been defined in a previous post, we will look at the bigrams.
We see ‘star rating’ is most popular, followed by “west region” and “healthcare economics”. (which may or may not be the title of the job we’re researching…)
bgs = nltk.bigrams(lemmatized) fdist = nltk.FreqDist(bgs) plot_freqdist_freq(fdist, max_num=15)
But what is “star rating”?
If we pass the tokens into the nltk.Text function, the text becomes an nltk object we can search key words with. Below we see star rating in the context of the job details. Google eventually helped me decipher this program’s purpose.
text = nltk.Text(tokens) # turn text into a NLTK Text object text.concordance("star")
Displaying 7 of 7 matches: m a subject matter expert in the CMS Star Rating system , and have a wealth of nd have a wealth of knowledge of CMS Star Rating Technical Specifications & Gu , developed , and maintained various star rating datasets , tools , models , a hip . • Owned and maintained several star rating datasets and tools , many of mance metric dashboards for Medicare Star Ratings for Medicare Advantage and P and reporting pertaining to Medicare Star Ratings . Emphasis on conducting " l ected effect on the overall contract star rating and subsequent revenue . Exte
The trigram doesn’t display any additional insight. The project references are all from one individual who has presented her research across the nation. (I should get started with research ideas!!)
tgs = nltk.trigrams(lemmatized) fdist_tri = nltk.FreqDist(tgs) plot_freqdist_freq(fdist_tri, max_num=15)
The word cloud is always a “pretty” visual. (I know this particular one isn’t as pretty as others, but it gets the job done.)
from wordcloud import WordCloud, STOPWORDS # join reviews to a single string words = ' '.join(tokens) wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', width=2000, height=1500).generate(words) plt.imshow(wordcloud, interpolation="bilinear") plt.axis('off') plt.show()
From the word frequency plots and the word cloud, this job defiantly relates to data analysis…
Overall, SAS and SQL are an important tool for analytics projects and reporting. The subject of analysis is often trends, claims, clinical and medicare data. Also, it may be good idea, when discussing this position, to emphasize my ‘love’ of group projects and teamwork.