![]() In the tidytext package, we provide functionality to tokenize by commonly used units of text like these and convert to a one-term-per-row format. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. We thus define the tidy text format as being a table with one-token-per-row. Each type of observational unit is a table.As described by Hadley Wickham ( Wickham 2014), tidy data has a specific structure: Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |