Homepage | Publications | Software | Courseware; indicators | Animation | Geo | Search website (Google) |
click here to download program
TWEET.exe for Semantic Maps (Co-Word Analysis) of Tweets
· Hellsten, Iina, & Leydesdorff, Loet (2019, forthcoming). Automated Analysis of Topic-Actor Networks on Twitter: New approach to the analysis of socio-semantic networks, Journal of the Association for Information Science and Technology. Preprint at https://arxiv.org/abs/1711.08387.
· Hellsten, I., Jacobs, S., & Wonneberger, A. (2019). Active and passive stakeholders in issue arenas: A communication network approach to the bird flu debate on Twitter. Public Relations Review, 45(1), 35-48; doi.org/10.1016/j.pubrev.2018.12.009
TWEET.exe generates a word-document occurrence matrix, a word co-occurrence matrix, and (if so wished) a normalized co-occurrence matrix from a set of lines (tweets) and a word list. The output files can be read into standard software (like SPSS, UCInet/Pajek, etc.) for the statistical analysis and the visualization.
Note: TWEET.exe is derived and adapted from TI.exe; that can be used to generate semantic co-word maps. A version of this latter program adapted for the Korean character set is available at http://www.leydesdorff.net/krkwic . A version for Chinese is available at http://www.leydesdorff.net/software/Chinese/index.htm. For languages using the Latin alphabet, one is advised to use tweet.exe and not ti.exe; and analogously FrqTwt.exe and not FrqList.exe.
input files
The program needs two inputs, namely, (a) the name of the file “words.txt” that contains the words (as variables) in ASCII format, and (b) a file “text.txt” in which each line provides a textual unit of analysis (e.g., a tweet). The number of lines is unlimited, but each line can at the maximum contain 4000 characters. Each line has to be ended with a hard carriage return (CR + LF). Save the file as plain text (DOS) with CR/LF in Word or in an ASCII editor such as Notepad.
The number of words (variables) is limited to 1024; but keep in mind that most programs (e.g., Excel) will not allow you to handle more than 256 variables in the follow-up. The words have to be on separate lines which are ended with a hard character return and line feed. (Save in Word as plain text (DOS) with CR/LF or use an ASCII editor (Notepad) for saving the file.)
• One can build a word frequency list with Frqtwt.Exe. This program reads <text.txt> and allows for the specification of a stopword list in <stopword.txt>. The results are provided as uppercase in the file <wrdfrq.txt>, and in a number of files for separate types of data: hashtag.dbf (words preceded with #), atsign.dbf (@), word.dbf, a_sand.dbf (&). (.dbf-files can be used in Excel; there are also .txt files.) These types are combined in wrdfrq.dbf in several columns. The various files facilitate the construction of a file words.txt needed hereafter for tweet.exe .
• Stopword.txt contains 429 stopwords (available at http://www.lextek.com/manuals/onix/stopwords1.html). Both lists—the lists of words and stopwords—have to be available in the same folder as frqtwt.exe.. The program checks the words in their current form (that is, without corrections for the plural). If stopword.txt is available, these words will not be included.
· Tweet.exe runs in a DOS-type Command Box under Windows. The program and the input files—text.txt and words.txt—have to be placed in the same folder. The output files are written into this directory as well. Please, note that existing files from a previous run are overwritten by the program. Save output elsewhere if you wish to continue with the materials.
output files
The program produces three output files. Matrix.txt can be read into Excel and/or SPSS for further processing. Two files with the extension “.dat” are in DL-format (ASCII) and can be read into Pajek or UCInet for network analysis and visualization. (Pajek is freely available at http://mrvar.fdv.uni-lj.si/pajek/ ).
a. matrix.txt contains an occurrence matrix of the words in the texts. The words are also the variable names in the SPSS syntax file labels.sps. One can read matrix.txt into SPSS using the text wizard and run labels.sps thereafter.
The matrix is asymmetrical: it contains the words as the variables and the tweets as the cases. In other words, each row represents a tweet in the sequential order of the text numbering, and each column represents a word in the sequential order of the word list. (One may wish to sort the word list alphabetically before the analysis.) The words are counted as frequencies with +1 for each occurrence.
b. coocc.dat contains a co-occurrence matrix of the words from this same data. This matrix is symmetrical and it contains the words both as variables and as row labels. The main diagonal is set to zero. The number of co-occurrences is equal to the multiplication of occurrences in each of the texts. (The procedure is similar to the routine “affiliations” in UCInet, but the main diagonal is here set to zero in this matrix.) The file coocc.dat contains this information in the DL-format that can be read by Pajek or UCInet.
c. Optionally: cosine.dat contains a cosine-normalized co-occurrence matrix of the words in the same data. Normalization is based on the cosine between the variables conceptualized as vectors (Salton & McGill, 1983). (The procedure is similar to using the file matrix.txt as input to the routine Proximity in SPSS.) The file cosine.dat contains this information in the Pajek-format. The size of the nodes is equal to the logarithm of the occurrences of the respective word; this feature can be turned on in Pajek. Tweet.exe can be stopped after running coocc.dbf and coocc.dat if one does not need the cosines.
More advanced options
After running the routines, the program prompts with the question of whether one wishes additionally to run the same routines with observed/expected values. (Note that answering “y” (yes) doubles the processing time of the original routine; therefore, the default is “n”.) For the purpose of these normalizations, the routine generates always the file words.dbf containing the following informations:
• A variable named “Residual” which provides the standardized residuals to the chi-square for each of the variables; these are defined for wordi as Zi = (Observedij – Expectedij) / √Expectedin. This value can be used for testing the significance of individual words in the set if the expected value is larger than five;
• A variable named “Obs_Exp” which provides the sum of |Observed – Expected| for the word as a variable summed over the column;
• A variable named “ObsExp” which provides the Obs/Exp ratios for the word as variable summed over the column;
• A variable named “TfIdf” (that is, Term Frequency * Inverse Document Frequency) defined as follows: Tf-Idf = FREQik * [log2 (n / DOCFREQk)]. This function assigns a high degree of importance to terms occurring in only a few documents in the collection (Salton & McGill, 1983, p. 63);
• The word frequency within the set.
The additional routine generates the following matrices: obsexp.dbf (analogous to matrix.dbf), obsexp.txt (analogous to matrix.txt), and coocc_oe.dat and cos_oe.dat, analogously to the above input files for Pajek, but now containing or operating on the observed/expected values instead of the observed ones counted. The SPSS syntax file labels.sps is not changed.
Similarly, one can use (with the same variable labels) the file tfidf.dbf which contains the Tf-Idf values. The expected values are stored in expected.dbf. Obs_exp.dbf contains the signed (!) difference between observed and expected values at the cell level. (These are the (non-standardized) residuals of the chi-square.)
The corresponding Pajek files can be generated by replacing the matrix values in cos_oe.dat with, for example, the cosine values of TfIdf.dbf. (Cosine values can be generated in SPSS under Analyze > Correlate > Proximity.) Or one can replace the non-normalized values directly in coocc_oe.dat. Note that the number of cases can be different using the later routine (of obs/exp) because rows with no values other than zero are removed in order to prevent divisions by zero in the computation.
I am grateful to Iina Hellsten for the collaboration, the ideas, and inspiration.
click here to download program
References
Salton, G. & M. J. McGill (1983). Introduction to Modern Information Retrieval. Auckland, etc.: McGraw-Hill.