Text Analysis Assignment

From Knowledge Kitchen
Jump to navigation Jump to search


You work evenings as an unpaid intern in a historical semantics research institute. Your boss, an absent-minded computer-illiterate professor of linguistics (a vestige from a bygone era), would like you to build a program to analyze the frequency of verbal tics in historical recorded speech.

The institute has hired a Hyderabad, India-based firm to convert the recorded speech into text. Now you need to be able to take these text transcripts and analyze them for verbal tics.

The program

The program must be able to open any text file specified by the user, and analyze the frequency of verbal ticks in the text. Since there are many different kinds of verbal ticks (such as "like", "uh", "um", "you know", etc) the program must ask the user what ticks to look for. A user can enter multiple ticks, separated by commas - any spaces entered by the user before or after each tic must be ignored.

The program should output:

  • the total number of tics found in the text
  • the density of tics (proportion of all words in the text that are tics)
  • the frequency of each of the verbal tics
  • the percentage that each tic represents out of all the total number of tics

Example input & output

This example shows suggested input/output of such a program. User responses are bold. Data analysis numbers are placeholder only and are not meant to be 'real'.

What file would you like to open? Resemble_Jammed_Inauguration_Speech.txt
What words would you like to search for? uh, like, um, you know, so

...............................Analyzing text.................................

Total number of tics: 66
Density of tics: 0.2

...............................Tic breakdown..................................

uh        /  19 occurrences  /  21% of all tics
like      /  17 occurrences  /   6% of all tics
um        /  22 occurrences  /  21% of all tics
you know  /  63 occurrences  /  32% of all tics
so        /  18 occurrences  /  20% of all tics

Program Requirements

  1. The program must be able to analyze any text file, but an example file must be included in the submission
  2. The user must be able to enter as many tics as they would like, separated by commas, with or without spaces
  3. Those tics can either be single words, such as "uh" and "like", or multi-word such as "uh huh" and "you know" - the code must accommodate these as well.
  4. The list of tics entered by the user must be stored in an array, not an ArrayList or any other array-like data structure
  5. You must use separate methods for each component of the analysis. At the least, this includes: 1) opening the file and importing its contents, 2) soliciting tic words or phrases from the user and separating them, 3) counting the occurrences of each tic, 4) calculating the percent of all tics that each tic consumes, and 5) calculating tic density.
  6. The output must be formatted so that all output lines up nicely as in the example
  7. The search for tics must be case insensitive
  8. round all occurrences and percentages to the nearest integer
  9. round the density to two decimal places


You may find it useful to split strings by more than one separator: See this example of splitting a string by multiple punctuation marks using the split() function.

What links here