The text I analyzed is a digitized and translated version of the original Chinese Confucian Analects. Confucius is a famous Chinese philosopher whose teachings and philosophy deeply influenced East Asian culture. The text has a version in HTML and plain text. I mainly focused on the text itself.
About the Project
Sources
For my project, I aim to extract a few abstract themes regarding the way of thinking back in ancient China without reading the full text. The full text has about 3,500 lines and 30,000 words. To visualize themes, I could have used the Voyant Tools and gotten the final output without doing much cleaning. However, I created a Data Cleanup program in Java to parse the text. My CS class had a homework project on creating a word counter and printing a word cloud, so I had a general idea of how to clean up the data. I explained the steps to clean the file in this GitHub. Here is a snippet of the overview:
The DataCleanup program takes two text files: text-to-analyze and stopwords. It processes the text-to-analyze file by removing non-alphabetical characters, converting all words to lowercase, and adding them to a list. It then processes the stopwords, commonly used words in English, adds them to a set, and removes that set of words from the list of words. Finally, it writes the cleaned data to a new text file containing only words and no stopwords.
I created another Java program to analyze the cleaned-up data. I focused on the word frequency of the file because I wanted to see what the text was mostly about. This part was where I made many changes to the stopwords file to create a more specific list; I added words such as “chap,” “gutenberg,” and “project.” Here is another overview of the program:
The DataAnalyze program takes a cleaned text from the output of the DataCleanup program. It calculates the word frequency by putting all the words into a map with a count order. It then writes the top 10 most frequent and least frequent words to a new text file. The file contains the words and their counts.
The final output of the code is here; it is not as appealing as using the Voyant Tools, so I also attached a picture of the text in a word cloud. The shared results between the word cloud and my analysis are “master,” “man,” and “people.”
Significance of the Project
Most frequent words meaning
The word “master” is the most frequent and implies the analects impart wisdom from one person to another person or group of people. A master is what people refer to as a teacher or skilled person. We can surmise that Confucius was concerned with the virtue and propriety of people. The visualization connects to Digital Humanities by revealing the prominent points of the text versus a list of plain words and numbers.