Topic modeling of Instagram posts from Canaima National Park

Capta

Inclusions: text and images relating to the word “Canaima”, from Instagram in the past 5 years, that show its use in a day-to-day context. In the most part, “Canaima” refers to Canaima National Park, which has been a hub for picturesque tourism in Venezuela in recent years, as well as restaurants and food trucks around the globe that have popped up recently as a result of Venezuelan migration. Given these characteristics, and also for how their user experience is molded around text and images, it makes sense to draw the data from Instagram. As Instagram does not allow to scrape (extract) text captions across a variety of profiles, but does allow to scrape text linked by a hashtag, I limited myself to posts that contained the hashtag #canaima.
Exclusions: Canaima used outside of a day-to-day context:
- Literary histories about Canaima. There is a rich line of literary work about the area of Canaima, many of them written before the space was institutionalized as a national park.
- Hollywood representations. Content from movies such as Up! and Point Break where Canaima appears.
Captaset

Hypothesis / thesis / research question

What can sentence topic models help us discover about social media posts in Canaima National Park?

Approach

My approach consists in reframing my work as a construction of a topic model where sentences, instead of words, are the topics. Previous work, such as that done by Balikas et al. (2016), has been aimed at improving LDA so that it is more sensitive to context contained in sentences. My work joins this discussion not only by considering the sensitivity of sentence-based context, but by also enriching topics themselves by including sentences (or texts of posts).

To do this, my sentence-based topic model assumes the following in contrast to LDA:

LDA (Blei 2013)	My approach
Each topic is a distribution of words	Each topic is a distribution of posts.
Each document is a mixture of corpus-wide topics	Each document is a mixture of corpus-wide topics.
Each word is drawn from one of these topics.	Each post is drawn from these topics.

Topics are collections of posts/sentences. In particular:

Each topic contains posts with similar meanings. So each topic is a grouping of prototypical posts with similar meanings.
Meanings of posts can be calculated with sentence embeddings.
Groupings of posts can be formed by clustering them in an n-dimensional space.
Prototypical posts of each cluster are those closest to the center of the cluster.

Sentence embeddings

Fundamental assumption (The distributional hypothesis): “Words that occur in similar contexts have similar meaning.” (Jurafsky & Martin 2020)