TinyTAWC

TinyTAWC (TAWC for Text Analysis and Word Count, based on Kramer et al's original TAWC program) is a modern, lightweight implementation of a LIWC-compatible linguistic text analysis tool.

Just looking for the download link? Find TinyTAWC on GitHub.

Capabilities

TinyTAWC can parse standard LIWC-format dictionaries or advanced dictionaries containing full regular expressions. It will then count how many words in the input text, which can be any plaintext file, match a category given in the dictionary.
This data can be displayed in human-readable form, JSON format, or TinyTAWC machine-readable format. Match counts can be either absolute matches or percentage of total words matching the category.

With the additional tools script that comes with TinyTAWC, you can easily combine multiple input files into one dataset that can be processed as one. Multi-text datasets can also be compared to each other after analysis with the tools script, showing you how much (percent difference or percentage point difference) a category differs between two or more source texts. TinyTAWC will automatically conduct a statistical chi² test on those differences, giving you an indication whether the observed differences are statistically significant.

For most operations, stdin and stdout can be used as data sources and output destinations, making it easy to integrate TinyTAWC into your workflow.

License & Installation

TinyTAWC is Open Source Software licensed under the MIT License. It is free to use, modify and redistribute for everyone. To use TinyTAWC, please refer to the installation instructions on GitHub. TinyTAWC runs natively on most MacOS and Linux systems. For use on Windows, you'll have to install Ruby first.

Should you have any issues using TinyTAWC, please open an issue on GitHub. Thanks!

Examples & Data

As the official LIWC dictionaries can be hard to find, here are some common usage examples for TinyTAWC complete with example data and dictionaries. The example dictionary used here is the Moral Foundations Dictionary compiled by Jesse Graham and Jonathan Haidt.

Basic Usage

For the most basic word count scenario, we run ruby ttawc.rb --human --sort moralfoundations.dic aliceinwonderland.txt to get a sorted list of the categories that match words in Alice in Wonderland. This should output something like this:

count | category | category name (if present)
27328 total
69 11 (MoralityGeneral)
26 07 (AuthorityVirtue)
17 02 (HarmVice)
16 05 (IngroupVirtue)
8 01 (HarmVirtue)
8 03 (FairnessVirtue)
3 10 (PurityVice)
2 09 (PurityVirtue)
1 04 (FairnessVice)
1 08 (AuthorityVice)

If we would prefer word counts as percent of total words, rounded to three significant digits, we add the options --percent --round=3, which gives us this:

count | category | category name (if present)
27328 total
0.252% 11 (MoralityGeneral)
0.095% 07 (AuthorityVirtue)
0.062% 02 (HarmVice)
0.059% 05 (IngroupVirtue)
0.029% 01 (HarmVirtue)
0.029% 03 (FairnessVirtue)
0.011% 10 (PurityVice)
0.007% 09 (PurityVirtue)
0.004% 04 (FairnessVice)
0.004% 08 (AuthorityVice)

Comparing datasets

Now we want to compare Alice in Wonderland to Kafka's The Metamorphosis. To do this, we create a single file containing both datasets with ruby tools.rb combine aliceinwonderland.txt metamorphosis.txt > comparison.txt. Then we use TinyTAWC's multi-line mode to analyze the file and save the results to a new file. For comparisons, the results need to be absolute values, not percentages. ruby ttawc.rb --linebased moralfoundations.dic comparison.txt > results.txt. At this point, results.txt should contain the word counts for both files in default TinyTAWC format.

%aliceinwonderland.txt 01:8 02:17 03:8 04:1 05:16 07:26 08:1 09:2 10:3 11:69 total:27328
%metamorphosis.txt 01:20 02:22 03:1 04:1 05:43 06:7 07:222 08:2 09:28 10:7 11:42 total:22375

Finally, we use the tools script to compare the two results and output the differences in percent, rounded again to three significant digits: ruby tools.rb --percent --round=3 compare results.txt

Comparing 1 datasets to master "aliceinwonderland.txt"
---- Master ----
"01": 0.029%
"02": 0.062%
"03": 0.029%
"04": 0.004%
"05": 0.059%
"07": 0.095%
"08": 0.004%
"09": 0.007%
"10": 0.011%
"11": 0.252%
"total": 27328.0
"06": 0.0%
---- metamorphosis.txt ----
"01": +205.341% (chi²=7.9)
"02": +58.059% (chi²=2.05)
"03": -84.733% (chi²=4.18)
"04": +22.136% (chi²=0.02)
"05": +228.241% (chi²=18.53)
"06": +Infinity% (chi²=8.55)
"07": +942.856% (chi²=199.39)
"08": +144.273% (chi²=0.57)
"09": +1609.908% (chi²=28.31)
"10": +184.985% (chi²=2.52)
"11": -25.656% (chi²=2.32)
"total": -18.124% (chi²=0.0)
[Warning] Statistics are tricky. Always sanity-check what you're doing

This result shows us, among other things, that The Metamorphosis has about 18% less words compared to Alice in Wonderland and that words relating to category 05 (Ingroup Virtue), 07 (Authority Virtue), and 09 (Purity Virtue) are used significantly more often than in Alice in Wonderland. Also note the +Infinity% in category 06 because Alice in Wonderland contains no Words from that category. For more information on the chi² values, read up on chi² tests. Generally speaking, differences with a χ² > 3.84 are considered to be statistically significant on a p < 0.05 level.

As a sidenote, the same result could have been achieved by piping the outputs of the first two commands into tools.rb directly: ruby tools.rb combine alice.txt meta.txt | ruby ttawc.rb --linebased moral.dic | ruby tools.rb --percent --round=3 compare

Conversational Data

You can also use the multi-line mode we used in the previous section to analyze 'conversational' input data like chatlogs or conversation transcripts. To do this, you will have to get your input data into a format where the first word of each line is the name of the person saying the words in that line. In this example we'll use the some mock-data from Ex Machina that is formatted to come quite close to WhatsApp chatlogs:


[12:34 8/4/2018] Nathan: Caleb. 
[12:34 8/4/2018] Nathan: Caleb Smith. 
[12:34 8/4/2018] Caleb: ... Hi. 
[12:34 8/4/2018] Nathan: Dude. I've been so looking forward to this. Come in, come in. 
[12:34 8/4/2018] Nathan: You want something to eat or drink after your journey? 
[12:34 8/4/2018] Caleb: No. Thank you. I'm fine.
(...)

As the time information at the beginning of each line does not contain any characters that TinyTAWC considers words, we can easily get rid of it using the tools script to clean the file while preserving linebreaks: ruby tools.rb --keeplines clean conversation.txt > conv_clean.txt


Nathan Caleb 
Nathan Caleb Smith 
Caleb Hi 
Nathan Dude I ve been so looking forward to this Come in come in 
Nathan You want something to eat or drink after your journey 
Caleb No Thank you I m fine 
(...)

If the names you are using are made up of multiple words, you might need to do a search & replace in you text editor of choice now to create single-word identifiers for each actor. When we're done preparing the input data, we can analyze it just like you would analyze a single input file: ruby ttawc.rb --linebased --human --sort moralfoundations.dic conv_clean.txt. The --linebased argument tells TinyTAWC to interpret the first word of every line as an ID for a dataset that line belongs to. Because we edited our data to have unique names as the first words, TinyTAWC creates separate datasets for every name and outputs stats for every member of the conversation.


count | category | category name (if present)
---- Nathan ----
564 total
4 11 (MoralityGeneral)
2 05 (IngroupVirtue)
1 03 (FairnessVirtue)
1 07 (AuthorityVirtue)
---- Caleb ----
183 total
3 11 (MoralityGeneral)
1 09 (PurityVirtue)

At this point, you can of course use this data (in non-human format) to compare the results to each other just as we did in the previous section.

Advanced options

When you're working with very large dictionaries, you might find it useful to only analyze a subset of the categories in a dictionary. Say we'd like to restrict our analysis of The Metamorphosis to the categories 07 (Morality Virtue) and 08 (Morality Vice). Calling TinyTAWC with the --include="07,08" option gives us the following output:


count | category | category name (if present)
22375 total
222 07 (AuthorityVirtue)
2 08 (AuthorityVice)

To dig a bit deeper and find out why so many words fall into the Authority Virtue category, we take a look at the words matching category 07: ruby ttawc.rb --include="07" --show-matching moralfoundations.dic metamorphosis.txt


position ~= 07
position ~= 07
mother ~= 07
mother ~= 07
mother ~= 07
father ~= 07
father ~= 07
control ~= 07
order ~= 07
father ~= 07
(...)
07:222 total:22375

From looking at this output, it becomes quite obvious that the spike in Authority Virtue is mostly due to the role of family in The Metamorphosis.

If you only want to exclude certain categories from your analysis, the --exclude option can be used in just the same way.

Download all files used in these examples here: TinyTAWC_examples.zip (102kB)