Before any data extraction takes place, authentication must happen. As it turns out, the best way to authenticate is to copy all cookies the browser is sending ALL of the cookies to the server.
First, all trace evaluations pages are collected from reverse engineering Trace's API. These are
stored
locally in a CSV file to be used later. Scraped URLs that contain tables of student-submitted
comments are in the form
https://www.applyweb.com/eval/new/showreport?c={COURSE_ID}&i={INSTRUCTOR_ID}&t={TERM_ID}&r=9&d=true
.
I'm unsure as to what the "r" and "d" parameters mean, but I've found that messing around with
throws a 500 internal server error.
Next, each URL is scraped, and the comments are extracted into a MongoDB collection. Since there
isn't an API to get comments, I had to use Beautiful Soup to extract
comments from an HTML table. Note that due to the amount of courses with ratings, I decided it
would be less tedious just to stick to the Spring 2024 semester (id: 181
).
Now, all comments are stored in MongoDB, stored by class and professor. Next, I aggregate all of the comments into one single collection, where comments are sorted by professor.
Finally, I build a Flask API to serve data from the MongoDB database. Search and retrival options, there is a route that converts the arrays of comments into a five word summary, powered by Google Gemini's AI. All API routes are cached behind Varnish Cache to prevent server overload and to prevent too many requests related errors from the Gemini API.