
Date: Fall 2021
View the code associated with this project here
This code seeks to group PDF page abstractions (images of the first page with black boxes where text is) using an agglomerative hierarchical clustering method. This work ties into the Tools for Tracking Police Misconduct Data project and seeks to group PDFs with similar formats so that a single information extractor method can be written for each type of PDF. This is the first project I have undertaken that uses artificial intelligence techniques and required me to approach text analysis from a different perspective.