Categories
Other

Using Agglomerative Clustering to Group PDFs by Format

Date: Fall 2021

View the code associated with this project here

This code seeks to group PDF page abstractions (images of the first page with black boxes where text is) using an agglomerative hierarchical clustering method. This work ties into the Tools for Tracking Police Misconduct Data project and seeks to group PDFs with similar formats so that a single information extractor method can be written for each type of PDF. This is the first project I have undertaken that uses artificial intelligence techniques and required me to approach text analysis from a different perspective.

Categories
Other

Tools for Tracking Police Misconduct Data

Date: Summer 2021 – Spring 2022

Partner(s): Pragya Kallanagoudar

My work on this project is part of a much larger effort to help the National Association of Criminal Defense Lawyers and several journalistic organizations create a database that tracks police misconduct. I worked on identifying police misconduct cases by querying a larger database of legal cases, extracting relevant information from PDF versions of case files, and creating human-in-the-loop tools to allow for the verification of my program output. This is the second large-scale academic research project I have been a part of and my first experience building specialized tools for an audience I am receiving regular feedback from.

css.php