Step Zero

Before any data extraction takes place, authentication must happen. As it turns out, the best way to authenticate is to copy all cookies the browser is sending ALL of the cookies to the server.

Step One

First, all trace evaluations pages are collected from reverse engineering Trace's API. These are stored locally in a CSV file to be used later. Scraped URLs that contain tables of student-submitted comments are in the form https://www.applyweb.com/eval/new/showreport?c={COURSE_ID}&i={INSTRUCTOR_ID}&t={TERM_ID}&r=9&d=true. I'm unsure as to what the "r" and "d" parameters mean, but I've found that messing around with throws a 500 internal server error.

Step Two

Next, each URL is scraped, and the comments are extracted into a MongoDB collection. Since there isn't an API to get comments, I had to use Beautiful Soup to extract comments from an HTML table. Note that due to the amount of courses with ratings, I decided it would be less tedious just to stick to the Spring 2024 semester (id: 181).

Step Three

Now, all comments are stored in MongoDB, stored by class and professor. Next, I aggregate all of the comments into one single collection, where comments are sorted by professor.

Step Four

Finally, I build a Flask API to serve data from the MongoDB database. Search and retrival options, there is a route that converts the arrays of comments into a five word summary, powered by Google Gemini's AI. All API routes are cached behind Varnish Cache to prevent server overload and to prevent too many requests related errors from the Gemini API.

How it Works

Step Zero

Step One

Step Two

Step Three

Step Four