Creating A Job Posting Search Engine Using OpenAI Embeddings

I recently worked on a job posting search engine and wanted to share how I approached it and some findings.

Motivation

I had a data set of job postings and wanted to provide a way to find jobs using natural language queries. So a user could say something like "job posting for remote Ruby on Rails engineer at a startup that values diversity" and the search engine would return relevant job postings.

This would enable the user to search for jobs without having to know what filters to use. For example, if you wanted to search for remote jobs, typically you would have to check the "remote" box. But if you could just say "remote" in your query, that would be much easier. Also, you could query for more abstract terms like "has good work/life balance" or some of the attributes that something like { key: values } would give.

Approach

We could potentially use something like Elasticsearch or create our own job search engine with rules, but I wanted to see how well embeddings would work. These models are typically trained on internet-scale data, so they might capture some nuances of job postings that would be difficult for us to model.

When you embed a string of text, you get a vector that represents the meaning of the text. You can then compare the embeddings of two strings to see how similar they are. So my approach was to first get embeddings for a set of job postings. This could be done once per posting. Then, when a user enters a query, I would embed the user's query and find the job posting vectors that were closest using cosine similarity.

Read on →

Using a Redlock Mutex to Avoid Duplicate Requests

I somewhat recently ran into an issue where our system was incorrectly creating duplicate records. Here's a writeup of how we found and fixed it.

Creating duplicate records

After reading through the request logs, I saw that we were receiving intermittent duplicate requests from a third-party vendor (applicant tracking system) for certain webhook events. We already had a check to see if records exist in the database before creating them, but this check didn't seem to prevent the problem. After looking closely, the duplicate requests were coming in very short succession (< 100 milliseconds apart) and potentially processed by different processes, so the simple check would not reliably catch the duplicate requests.

Read on →

Using iTerm Automatic Profile Switching to Make Fewer Mistakes In Production

Today I will tell you some stories of how I made mistakes in our production environment, and how I am trying to help prevent future mistakes using iTerm.

Mistakes were made

At work we are mid-journey to having more automation around our deployments, provisioning, backups, monitoring, and so forth. But at the moment, we have some things that are typically done manually. Within recent memory, I was SSHed into our QA (staging) box and for some reason wanted to rename the database. A few minutes later, someone came down and said "production's down!" 1 (Production is the end-user visible environment, the one thing that we don't want to be down.) I was thinking, "hmm, we haven't changed anything recentl… wait, was I actually on the QA box?" Sure enough, what I renamed was the production database on the production environment! A minute later service was restored, but this was the most downtime this quarter during the day (a handful of minutes.)

As part of our postmortem on this issue, we identified that switching my terminal profile whenever I thought I would be in a production-like environment would be useful. For example, if I am going to be SSHing into a QA box, I might create a new profile that has a different background color. This would help disambiguate the two environments.

The other day after hours, I was switching back and forth between QA and production SSH environments to try to debug a problem on the QA side. I again thought that I had SSHed into the QA environment but I didn't read my SSH command well enough when cycling between those environments (using Ctrl+r in the terminal will give you previous commands2). I turned off the production load balancer. Fortunately it was after hours, so I could easily revert it, but I needed a better solution.

Enough is enough

There are two problems with the profile switching approach: I need to remember to switch profiles when I am SSHing, and I need to be SSHing into the right environment for the given profile. These are error-prone enough that I don't think the manual profile switching approach is workable long-term. Again, in a perfect world, we would have everything already automated and some way of making all of our changes through well-tested or peer-reviewed means. But there has to be a stopgap solution.

Read on →

Squashing Intermittent Tests With ntimes

Today I want to share a tool that I have found indispensable for finding and fixing intermittent tests in test suites. It's a little script I wrote, called ntimes.

Based on the commit logs to my dotfiles repository, until about 2014, to run the same command many times, I would press up in my terminal and press enter. While effective, this approach has the disadvantage of requiring me to be present at the machine and do manual work. I thought: there must be a better way.

So, probably by cribbing from somewhere and adding my own extensions, I made a script that could run an arbitrary command-line command multiple times and report a summary at the end. To use it, I would use something like:

Read on →