Beyond n-grams, tf-idf, and word indicators for text: Leveraging the Python API for vector embeddings

William Buchanan

Beyond n-grams, tf-idf, and word indicators for text: Leveraging the Python API for vector embeddings

2021

William Buchanan

This talk will share strategies that Stata users can use to get more informative word, sentence, and document vector embeddings of text in their data. While indicator and bag-of-words strategies can be useful for some types of text analytics, they lack the richness of the semantic relationships between words that provide meaning and structure to language. Vector space embeddings attempt to preserve these relationships and in doing so can provide more robust numerical representations of text data that can be used for subsequent analysis. I will share strategies for using existing tools from the Python ecosystem with Stata to leverage the advances in NLP in your Stata workflow.

Keywords:

Leverage (statistics)
Structure (mathematical logic)
Python (programming language)
Natural language processing
Workflow
Sentence
tf–idf
Computer science
Meaning (linguistics)
Word (computer architecture)
Artificial intelligence

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations