In preparation for my webinar about advanced vocabulary words in the movie Knives Out: Glass Onion, I seized the opportunity to practice writing a little Python script.

The idea is to collect high-frequency words from major exams, including TOEFL, SAT, GRE, and GMAT. I collected various lists from test prep sites and forum, cleaned the data, and compiled the words into one csv file containing 800 words. From the web, I also obtained the script of the movie.

By creating the function, extract_context(script_path, freq_words_path), I was able to scan the script in search of words that appear on the high-frequency list.

Initially, I got results containing very simple words and it was because the program was extracting words that partially match (for instance, “dog” was taken several times from the script because it partially matched with “dogmatic”.)

To resolve this problem, I implemented code to ensure exact matches, and the results were much better.

In the end, I still went through the whole script again to manually pick out other useful words. It would be useful to compile several other collections of words that I can use to scan movie scripts in addition to the current list of 800 words.

import pandas as pd
import spacy #this is for tokenization
import re # this is to ensure exact match

# Load the English tokenizer, POS tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")

def extract_context(script_path, freq_words_path):
    # Read movie script
    with open(script_path, 'r') as f:
        script = f.read()

    # Read high frequency words
    freq_words_df = pd.read_csv(freq_words_path)
    freq_words = freq_words_df['Word'].tolist()

  # Tokenize script into sentences
    doc = nlp(script)
    sentences = [sent.text for sent in doc.sents]

    # Prepare results list
    results = []

    # Scan sentences for each word
    for word in freq_words:
        for sentence in sentences:
            # Use regex to find exact word match with word boundaries (\b)
            if re.search(r'\b' + re.escape(word) + r'\b', sentence):
                results.append((word, sentence))
                print(f"Matched: {word} in Sentence: {sentence}")  # Print the matched word and sentence
            else:
                print(f"No match for: {word} in Sentence: {sentence}")  # Print when no match is found


    # Convert results to dataframe
    df = pd.DataFrame(results, columns=['Word', 'Contextual Sentence'])

    return df

df = extract_context("Glass Onion Transcript.txt", "wordlist.csv")
df.to_csv('filename.csv', index=False)

Similar Posts