Book analysis with AI techniques
About this lesson
This learning sequence explores text analysis through Natural Language Processing, a significant application of Artificial Intelligence. Teachers and students are led through a series of video tutorials to develop a Python program that can break down and analyse the content of a complete text and use smart sentiment analysis to attempt to determine the villain(s) and hero(s).
Year band: 7-8, 9-10
Curriculum Links AssessmentCurriculum Links
Links with Digital Technologies Curriculum Area
Strand | Year | Content Description |
---|---|---|
Processes and Production Skills | 7-8 |
Analyse and visualise data using a range of software, including spreadsheets and databases, to draw conclusions and make predictions by identifying trends (AC9TDI8P02) Define and decompose real-world problems with design criteria and by creating user stories (AC9TDI8P04) Design the user experience of a digital system (AC9TDI8P07) Generate, modify, communicate and evaluate alternative designs (AC9TDI8P08) Design algorithms involving nested control structures and represent them using flowcharts and pseudocode (AC9TDI8P05) Trace algorithms to predict output for a given input and to identify errors (AC9TDI8P06) |
9-10 |
Analyse and visualise data interactively using a range of software, including spreadsheets and databases, to draw conclusions and make predictions by identifying trends and outliers (AC9TDI10P02) Define and decompose real-world problems with design criteria and by interviewing stakeholders to create user stories (AC9TDI10P04) Design and prototype the user experience of a digital system (AC9TDI10P07) Evaluate existing and student solutions against the design criteria, user stories, possible future impact and opportunities for enterprise (AC9TDI10P10) Design algorithms involving logical operators and represent them as flowcharts and pseudocode (AC9TDI10P05) Implement, modify and debug modular programs, applying selected algorithms and data structures, including in an object-oriented programming language (AC9TDI10P09) |
Assessment
Each part of this learning sequence builds on the previous part. Students can be encouraged to type up their own code as demonstrated in the videos. (Links to milestone completed programs are provided after each part, in case of student absences or confusion.)
A number of project ideas are suggested at the end of the sequence, ranging from simple to highly ambitious. Students may collect appropriate text, prepare program designs and/or implement coded programs in response to these prompts.
In assessing code in languages like Python, consider a rubric that brings in important skills for General Purpose Programming.
Learning sequence
- Overview
- Part 1: Setup
- Part 2: Removing punctuation
- Part 3: Tokenisation
- Part 4: Modular programming
- Lecture: Sentiment analysis
- Part 5: Testing sentiment analysis
- Part 6: How many times does each word appear?
- Part 7: Ranking words by frequency
- Part 8: Removing stop words
- Part 9: Heroes and villains
- Projects / Assessment
- Resources
Overview
View the Overview video for more information on this learning sequence, including a short lecture on Natural Language Processing. (Check the Resources section for links to more information on advanced concepts such as Part Of Speech Tagging and Lexicon Normalisation.)
Part 1: Setup
View the video Setup. This will introduce the repl.it coding environment, and the process for obtaining the text for Alice in Wonderland from the online repository Project Gutenberg.
Questions for discussion
The presenter uses the site Project Gutenberg to obtain the text for Alice in Wonderland. Why is this site used?
The presenter refers to repl.it as a Python IDE. What is an IDE, and why do we use one?
A: IDE stands for Integrated Development Environment. Unless you are coding in Windows Notepad, you are probably using an IDE. IDEs gather the tools a programmer needs to type out code, test the program with the push of a button, and more easily identify errors. Code text is usually colour-coded to help with readability, documentation is readily available and syntax errors are underlined automatically, similarly to how spelling or grammar errors are identified in a modern word processor. Advanced IDEs also provide drag-and-drop tools to create graphical user interfaces.
When obtaining Alice in Wonderland, the presenter selects the Plain Text format. Why might other formats like HTML, EPUB or Kindle be unsuitable for text analysis?
Part 2: Removing punctuation
View the video Removing punctuation. In this part, we write and test a function to remove all punctuation from the text, so that our text analysis can focus on words only.
(Completed code up to this point.)
Questions for discussion
The goal in this part is to remove all punctuation so that only words remain. Could this hinder our efforts to analyse the text for sentiment?
A: Punctuation can certainly change the meaning of a string of words. (eg. "Give to charity. Please no presents." vs. "Give to charity? Please, no. Presents!") The sentiment analysis in this learning sequence is less advanced than other approaches that might be in existence. It will rely on the volume and strength of certain words associated with other words.
The presenter demarcates the string of punctuation characters using triple apostrophes at the start and end. eg, '''#%^&*'".,_-=()''' Why was this necessary?
The code for removing punctuation is placed inside a function. What are some advantages of this approach?
A: Functions allow a programmer to organise things by separating some code from the main program, so that section of code can be called (run) whenever required. Parameters then allow the main program to supply different values to the function. In the video, a parameter st is added so that any book text can be supplied to the remove_punctuation(st) function.
The presenter names the function remove_punctuation(…). Are there rules about naming functions and variables in Python?
A: The naming of functions and variables is very important for code readability, so that other programmers can understand your code. One useful approach is to name functions with verbs, since they are performing an action. (Note: Using underscores in names is a Python convention rather than a rule. In other languages, like JavaScript, the convention is to use 'camel case', eg. removePunctuation(…))
Skill review
Functions will be used throughout this learning sequence, including functions with parameters and return variables.
View the video Intro to Functions in Python for a brief introduction to writing and using functions.
Part 3: Tokenisation
View the video Tokenisation 1, noting the minor changes made manually to the text of Alice in Wonderland before the coding begins:
- The word CHAPTER has been added before each chapter heading, eg. “CHAPTER I--DOWN THE RABBIT-HOLE”).
- The Gutenberg license text has been removed from the end of the file.
In the part, we write new functions to break up the book text into a list of words, or into a list of sentences.
(Completed code up to this point.)
Questions for discussion
When writing the create_word_list(…) function, most of the work is done by a built-in string function called split(). What exactly does split() do? (Hint: you can always look up the official Python documentation, or a Python cheat sheet.)
A: Python's built-in split(…) function breaks up a string, resulting in a list that contains each separate part. If no argument is given, as in the case of creating a list of words, the string is split wherever a space occurs. A different character can be provided as an argument, such as a full stop ('.') to break up the string by sentences.
During the video, the presenter decides to improve the remove_punctuation(…) function. What change is made?
A: An additional parameter called exception is added. This allows one punctuation character to be designated to not be removed along with all the others. Inside the function, the exception character is immediately removed from the punctuation string, so that it does not get removed from the text itself.
In the improved remove_punctuation(…) function, what does it mean that the new parameter is written as exception=''
A: When parameters are written with an = sign, this means there is a default value. It allows someone to call the function without having to supply an argument for that parameter. In this case, exception has a default value of an empty string ''. With this value for the parameter, the punctuation string will remain intact and the removal of punctuation will work as normal.
Now view the video Tokenisation 2. In this video, we write two more functions to break up the book text into a list of paragraphs, or into a list of chapters. This completes our library of functions for tokenising the book's text.
(Completed code up to this point.)
Skill building
Working with large bodies of text usually requires a bit of manual editing.
View the video Text File Preparation for more details on obtaining and editing the most suitable book text of Alice in Wonderland.
Part 4: Modular programming
View the video Modular programming. In this part, the functions we've created are separated into a different file.
(Completed code up to this point.)
Questions for discussion
Besides cleaner main programs, what other advantages might come from this modular approach of placing groups of functions into separate files?
A: A modular approach to coding makes it easier for different programmers to work on a project. Programmer A can use a function from a file written by Programmer B without needing to see the code inside it, as long as there is documentation explaining how to use the function. Programmer B can make changes to the internal code in a function without necessarily disturbing Programmer A.
In the video, the presenter uses the import statement to connect the functions from the new tokenization.py file into the main program. Have you used import to access other modules in the past?
Lecture: Sentiment analysis
Lecture: Sentiment analysis
View the video Lecture – Sentiment Analysis to:
- discover interesting connections between linguistics and digital technologies,
- import a Python module that makes it easy to incorporate sentiment analysis into your own programs,
- explore two numbers for measuring sentiment: polarity and subjectivity.
Lectures are primarily intended for a teacher audience, but you may choose to re-view the video with students.
Part 5: Testing sentiment analysis
View the video Testing Sentiment Analysis. In this part, the TextBlob module is used to attempt to rate the polarity and subjectivity of sentences, paragraphs and chapters. (For an explanation of these two concepts, be sure to view the lecture in the previous section.)
(Completed code up to this point.)
Questions for discussion
What does polarity measure?
What is the numerical range of the polarity value?
What does subjectivity measure?
A: Subjectivity can be thought of as a measure of how emotive the language in a block of text seems to be. A high value means that the text contains a high number of adjectives and nouns that are associated with strong feelings from the writer. A low value means that the text can be thought of as more “clinical” or objective, containing fewer emotive words.
What is the numerical range of the subjectivity value?
Do you think this way of analysing text sentiment is foolproof?
A: Despite its basis in linguistics research, this is still a crude means of determining sentiment when compared to a mature human’s analysis. This demonstrates the complexity of human communication. (Try inputting a sarcastic sentence, for example.) Developments in Machine Learning may result in more reliable sentiment analysis results.
Part 6: How many times does each word appear?
View the video Frequency of Words 1. In this part, we write and test a function to store each unique word in the book alongside the number of times that word appears in the book. In PART 7, we’ll rank these to find out the most frequent words in the book.
(Completed code up to this point.)
Skill review
A dictionary data structure is used to hold the data for the most frequent words. This data structure is sometimes called an associative array or a map in other programming languages.
View the video Intro to Dictionaries for a brief introduction to the dictionary data structure.
These short exercises will help you practice using Python Dictionaries:
- Exercise 1 – accessing elements and printing (solution)
- Exercise 2 – creating dictionaries and adding elements (solution)
- Exercise 3 – accessing elements with a loop (solution)
Questions for discussion
Why is the dictionary data structure suitable for storing each unique word in the book with its frequency?
Couldn’t two lists be used instead of a dictionary?
A: It would be feasible to use two parallel lists. One list would contain all the unique words and the other list would contain all the frequencies. This may not be considered an ideal solution because it relies on the lists always being kept in sync. For example, removing an element from one list means the same element must be removed from the other list. This leaves room for a programmer to forget, and then you have a bug.
The function create_frequency_dictionary(words) has a single parameter words. What is expected to be given here?
The function create_frequency_dictionary(words) has two sections of code inside, each with a loop. The first section builds a list of unique words from the complete list of the book’s words. What does the second section do?
Now view the video Frequency of Words 2. In this video, we make some improvements to our function. Now, the dictionary will be constructed ignoring the case of the words from the book. So, “Rabbit” and “rabbit” will now be considered one unique word, and the frequency value will reflect all instances of both words.
But, if the function’s new second argument cap is set to True, something quite different will happen. The dictionary will be constructed with only the Title Case words from the book, such as proper nouns. So, “rabbit” will not be included at all, but words like “Rabbit” and “Alice” will be included.
During this video, the presenter makes use of a Python shortcut called a List Comprehension, which allows a list to be quickly made from another list without many lines of code. This is not an essential skill and is merely done for convenience. See this external tutorial for more information on List Comprehensions.
Part 7: Ranking words by frequency
View the video Ranking Words by Frequency. In this short part, we make use of a dictionary created with the function we wrote in part 6. The dictionary contains unique words from the book alongside how often they appear in the book. Now we will sort those entries. The result is a simple list of the words ordered by frequency.
(Completed code up to this point.)
Questions for discussion
What is the purpose of the sorted(…) function used in this video?
A: The sorted(…) function is a very powerful function built into Python (see this article from the Python documentation for more). In our case, we are using it to sort the unique words from the dictionary containing the unique words and their frequencies. We sort the words from most to least common according to their frequencies. The result is not another dictionary, but a simple list called ranked_list.
Part 8: Removing stop words
View the video Removing Stop Words. In this part, we write a function to filter out very common English words, and another function to filter out stop words like “the”, “I”, “and”. By first removing all these from the list of words in the book, our ranked list of frequent words will be more useful.
(Completed code up to this point.)
- Click here to access the webpage with 1000 common words.
- Click here to access the webpage with stop words.
Questions for discussion
What is the difference between the two new functions remove_common(words) and remove_stop_words(words) we created in this video.
A: The remove_common(words) function looks for the occurrences of 1000 well-known words within the full list of words in the book, removing any that it sees. The remove_stop_words(words) function does the same thing, but it only filters out a smaller selection of words like “if”, “and”, “the”.
Part 9: Heroes and villains
View the video Heroes and Villains 1. By now, we have identified the likely main characters in the book. In this part, we use sentiment analysis to guess whether each character is a hero or villain.
(Completed code up to this point.)
Questions for discussion
Given a character’s name, exactly how does our new function try to guess whether that character is a hero or a villain?
Do you think this is an effective way to determine whether a character is a hero or villain?
Now view the final video Heroes and Villains 2. In this video, we try a couple of other books, and we tweak the behaviour of the function for judging the main characters.
Projects / Assessment
Initially, students may wish to try reusing the program already written in this learning sequence:
- choose a different book that can be obtained in Plain Text format,
- try to identify the main characters,
- try to determine which of the main characters are heroes or villains,
- by judging the polarity of each chapter, try to determine if the story has a happy or sad ending.
Sometimes it can be hard to tell if articles from newspapers and other online sources are reporting pieces or opinion pieces.
Write a fresh program that uses sentiment analysis to determine if an article is a reporting piece or an opinion piece, based on the subjectivity of its language. We might expect opinion pieces to have higher subjectivity.
You will need to:
- find at least 3 reporting pieces and 3 opinion pieces, ideally from the same news source,
- obtain or convert each article into Plain Text format (eg. using copy-paste),
- use the TextBlob library to analyse the subjectivity of the article,
- make a decision whether the article is reporting or opinion.
Note, you may wish to reuse modules or functions written in this learning sequence, but your main program must be freshly written, with appropriate comments.
Design and implement a research project test a limit of the sentiment analysis approach used in this learning sequence. eg. How well does it respond to different conversational styles?
Write a new tool for analysing dialogue in a movie or play script.
The tool will be able to separate each paragraph, ignore stage/screen directions and notes, and connect each bit of dialogue with a character from a limited cast.
From there, characters can be compared based on volume of dialogue and sentiment analysis of dialogue.
Resources
- Coding
- Python cheat sheet (from Grok Learning)
- Another Python cheat sheet that focuses on string functions (ways to manipulate text)
- Visual to text coding series of lessons with videos and exercises to help you and your class transition from visual coding (eg Scratch) to general purpose programming (eg Python and JavaScript)
- Natural Language Processing theory
- Natural language processing on Wikipedia
- An article on part-of-speech tagging.