On-Demand Tutorial
Basic Text Analysis in R
In this part of the R workshop series, learn how to turn messy, unstructured text into an insightful, high-level overview using basic text analysis skills in R. You’ll walk through an hands-on example scenario from start to finish and learn about steps to explore and clean text data, how to conduct a few types of text-based analysis, and key details about interpreting and evaluating results of text analysis along the way.
Follow along by downloading the HTML file from our GitHub page.
1. Outline project
As with most data analysis projects, start by defining the scope of the project and question to be addressed within it, and next organize and examine the variables you will be working with.
In this workshop, examine a small subset of text data from reviews in McAuley & Leskovecs’ (2013) Amazon Fine Food Reviews dataset, available on the Stanford Network Analysis Project website (see SNAP website for details). This example shows in action how basic text analysis might be used to provide a high-level overview and offer insight into real-world text data.
2. Explore data and learn about text cleaning steps
Walk through some key exploratory data steps to better understand the text, including average review length and ratings, for necessary context.
Next, learn about the basics of text cleaning, why it can be helpful, and when you might consider using certain text analysis steps. Throughout text analysis, you’ll also learn how to test how each text cleaning step impacts results using sensitivity analyses.
In this tutorial, we implement the following text cleaning steps:
- Placing all of our review text in lower-case
- Removing extra white space and numbers
- Replacing contractions
- Removing punctuation
- Lemmatize words down to their bases
- Tokenizing so that each word is on its own line
- Removing stop words that don’t add semantic value
3. Conduct basic text analysis
Finally, complete a set of basic text analyses to gain insights. Firstly, get an overview of frequent terms across reviews by examining the most frequent terms and bigrams in the review dataset example and how these break down when comparing high- and low-scoring reviews. Next, learn how to apply a popular topic modeling technique to get an overview of common topics across the dataset and examine how positive or negative review language tends to be with sentiment analysis. Across all analyses, develop your critical eye to choose the best methods for different use cases – for example, learn how text cleaning can impact results using sensitivity analyses and consider tradeoffs between multiple sentiment analysis options.
Pro tips
A few tricks can make your next text analysis even stronger:
- Interpret judiciously: Be clear about how to interpret the results of your text analyses: keep in mind that on their own, text analyses discussed here typically provide observational results and are focused solely on the language used. For example, more negative linguistic sentiment does not reflect more negative experience on its own because essential context, such as sarcasm, is often not considered in text analyses. Consider how methods you are using were developed and how they work when interpreting results as well.
- Select text cleaning and analysis methods carefully: There are many different text cleaning and analysis methods out there, and it is worth considering the pros and cons of each depending on your use case. If you’re not sure, compare multiple options and leverage sensitivity analyses to show how results differ depending on text cleaning and analysis methods. Throughout the process, be transparent about which approaches you’re using and why.
- Expand your toolbox with additional resources: This workshop introduces the audience to text analysis basics, but there are many more options worth exploring to get the best fit for your project. Additional resources are included for those interested in diving deeper.
Conclusion
With the tools discussed in this workshop, you should have the background you need to approach messy text data in your next big project with confidence to gain a useful overview of your dataset.
In other R workshop series videos, learn about 3 things you didn’t know you could do in R, interactive data visualizations in R, and how to create a data dashboard in R using Quarto.
