Homepage | Publications | Software | Courseware; indicators | Animation | Geo | Search website (Google) |
click here to download program
TI.exe for Co-Word Analysis
TI.exe is freely available for academic usage. The program generates a word-occurrence matrix, a word co-occurrence matrix, and a normalized co-occurrence matrix from a set of lines (e.g., titles) and a word list. The output files can be read into standard software (like SPSS, Ucinet/Pajek, etc.) for the statistical analysis and the visualization. A version adapted for the Korean character set is available at http://www.leydesdorff.net/krkwic . A version for Chinese is available at http://www.leydesdorff.net/software/Chinese/index.htm.
input files
The program needs two informations, notably, (a) the name of the file <words.txt> that contains the words (as variables) to be analyzed in ASCII format and (b) a file text.txt in which each line provides a textual unit (e.g., a title). The number of lines is unlimited, but each line can at the maximum contain 4000 characters. Each line has to be ended with a hard carriage return (CR + LF). The number of words is limited to 1024, but keep in mind that most programs (e.g., Excel) will not allow you to handle more than 256 variables in the follow-up. The words have to be on separate lines which are ended with a hard character return and line feed. (Save in Word as plain text with CR/LF or use a DOS utility (e.g., CRLF.EXE, available at the Internet) for saving the file.)
• If some texts are larger than 999 characters, you can use fulltext.exe instead. FullText.exe can handle an unlimited number of text files to a size of 64 k each.
• One can .build a word frequency list with FrqList.Exe. This DOS-program reads <text.txt> and allows for the specification of a stopword list in <stopword.txt>. The results are provided as uppercase in the file <wrdfrq.txt>.
• One can also build a word frequency list with a concordance program. For example, TextSTAT-2 is freeware and online. Please, remove hyphens and interpunction from the words in words.txt. Additionally, stopword.exe is available for correction using a given list with stopwords (e.g., stopword.txt; 429 stopwords at http://www.lextek.com/manuals/onix/stopwords1.html). Both list—the lists of words and stopwords—have to be available in the same folder. This program just checks the words in their current form (that is, without corrections for plural or for uppercase/lowercase forms).
program file
The program is based on DOS-legacy software from the 1980s (Leydesdorff, 1995). It runs in a MS-Dos Command Box under Windows. The programs and the input files have to be contained in the same folder. The output files are written into this directory as well. Please, note that existing files from a previous run are overwritten by the program. Save output elsewhere if you wish to continue with the materials.
output files
The program produces three output files in dBase IV format. These files can be read into Excel and/or SPSS for further processing. Two files with the extension “.dat” are in DL-format (ASCII) and can be read into Pajek for the visualization (freely available at http://vlado.fmf.uni-lj.si/pub/networks/pajek/ ).
a. matrix.dbf contains an occurrence matrix of the words in the texts. This matrix is asymmetrical: it contains the words as the variables and the texts as the cases. In other words, each row represents a text in the sequential order of the text numbering, and each column represents a word in the sequential order of the word list. (It is advisable to sort the word list alphabetically before the analysis.) The words are also the variable names although truncated to ten positions. The words are counted as frequencies. (The plural “s” is removed before processing.)
The file matrix.dbf can also be imported into SPSS for further analysis. An additional file labels.sps can be read into SPSS as a script; after running it provides the variables with the full words as variable labels. Note that SPSS reads only 255 variables correctly from a .dbf file. If one wishes to use more words, one can read matrix.txt into SPSS using the text wizard and run labels.sps thereafter.
b. coocc.dbf contains a co-occurrence matrix of the words from this same data. This matrix is symmetrical and it contains the words both as variables and as labels in the first field. The main diagonal is set to zero. The number of co-occurrences is equal to the multiplication of occurrences in each of the texts. (The procedure is similar to using the file matrix.dbf as input to the routine “affiliations” in UCINET, but the main diagonal is here set to zero in this matrix.) The file coocc.dat contains this information in the DL-format.
c. cosine.dbf contains a normalized co-occurrence matrix of the words from the same data. Normalization is based on the cosine between the variables conceptualized as vectors (Salton & McGill, 1983). (The procedure is similar to using the file matrix.dbf as input to the corresponding routing in SPSS.) The file cosine.dat contains this information in the Pajek-format. The size of the nodes is equal to the logarithm of the occurrences of the respective word; this feature can be turned on in Pajek.
d. words.dbf contains for all words the following summations:
Corresponding files such as obs_exp.dbf, expected.dbf, and TfIdf.dbf are also generated with information at the cell level.
More advanced options
After running the routines, the program prompts with the question of whether one wishes additionally to run the same routines with observed/expected values. This generates obsexp.dbf (analogous to matrix.dbf), obsexp.txt (analogous to matrix.txt), and coocc_oe.dat and cos_oe.dat, analogous to the above input files for Pajek, but now containing or operating on the observed/expected values instead of the observed ones. Note that answering “y” (yes) doubles the processing time of the original routine; therefore, the default is “n”. The SPSS syntax file labels.sps is not changed.
Similarly, one can use (with the same variable labels) the file tfidf which contains the Tf-Idf values. The expected values are stored in expected.dbf. Obs_exp.dbf contains the signed (!) difference between observed and expected values at the cell level. (These are the (non-standardized) residuals of the chi-square.) The corresponding Pajek files can be generated by replacing the matrix values in cos_oe.dat with, for example, the cosine values of TfIdf.dbf. (Cosine values can be generated in SPSS under Analyze > Correlate > Proximity.) Or one can replace the non-normalized values directly in coocc_oe.dat. Note that the number of cases can be different using the later routine (of obs/exp) because rows with no values other than zero are removed in order to prevent divisions by zero in the computation.
Examples of using these programs can be found in:
- Loet Leydesdorff, The University-Industry Knowledge Relationship: Analyzing Patents and the Science Base of Technologies, Journal of the American Society for Information Science and Technology (JASIST) 55(11) (2004), 991-1001; <pdf-version>
- Loet Leydesdorff & Iina Hellsten, Metaphors and Diaphors in Science Communication: Mapping the Case of ‘Stem-Cell Research’, Science Communication 27(1) (2005), 64-99. <pdf-version>
- Loet Leydesdorff & Kasper Welbers, The semantic mapping of words and co-words in contexts, Journal of Informetrics (2011; in press); preprint version available at http://arxiv.org/abs/1011.5209.
click here to download program
References
Leydesdorff, L. (1995). The Challenge of Scientometrics: The development, measurement, and self-organization of scientific communications. Leiden: DSWO Press, Leiden University; at http://www.upublish.com/books/leydesdorff-sci.htm .
Bornmann, L., & Leydesdorff, L. (2011). Which cities produce excellent papers worldwide more than can be expected? A new mapping approach—using Google Maps—based on statistical significance testing. Preprint available at http://arxiv.org/abs/1103.3216.
Mogoutov, A., Cambrosio, A., Keating, P., & Mustar, P. (2008). Biomedical innovation at the laboratory, clinical and commercial interface: A new method for mapping research projects, publications and patents in the field of microarrays. Journal of Informetrics, 2(4), 341-353.
Salton, G. & M. J. McGill (1983). Introduction to Modern Information Retrieval. Auckland, etc.: McGraw-Hill.
Links to programs for (Porter’s) stemming:
http://maya.cs.depaul.edu/~classes/ds575/porter.html
http://snowball.tartarus.org/demo.php
Links to programs for parsing:
http://l2r.cs.uiuc.edu/~cogcomp/eoh/posdemo.html
http://l2r.cs.uiuc.edu/~cogcomp/shallow_parse_demo.php
http://nlp.stanford.edu:8080/parser/
http://alias-i.com/lingpipe/web/demos.html
php-versions of Porter’s stemmer:
http://www.chuggnutt.com/stemmer-source.php
http://www.phpguru.org/downloads/PorterStemmer/PorterStemmer.phps
http://webscripts.softpedia.com/scriptDownload/Porter-Stemming-Algorithm-Download-46193.html