Setting the foundations for a job title classification system.
Truework has data on millions of positions held by employees in companies ranging from small startups to large corporations. With data on companies in sectors such as HR & Staffing, Manufacturing, and Tech, we are in a unique position with real ground-truth data from a variety of sources. Arguably one of the most important pieces of information for a position is the position's job title, as it is often shared with verifiers while processing your loan or mortgage.
One of the principal problems we face in this space is dirty data. The vast majority of our data is human entered text, which comes with a large variety of inherent issues. These issues include typos and misspellings, obscure abbreviations, and sometimes gibberish. Examples of these can be found in the sample of raw titles shown below. We wanted to use a combination of NLP techniques and data standardization to improve the quality of these job titles without destroying the input. This allows us to build internal tools to improve quality assurance, integrations, and better understand our customers.
To begin to tackle a problem such as Job Title Classification, we first looked to see how others attempted to solve similar problems. Online recruitment portals such as LinkedIn Jobs and Careerbuilder have established their own classification systems and have published excellent literature on the topic. The US government maintains the publicly available Standard Occupation Classification (SOC) System, which we found to be inadequate for our needs. So we chose to implement our own.
Any classification system requires 4 larger stages of development that can be broken into many smaller steps depending on the exact domain your classification system resides within.
This process is highly cyclical. Each stage must be iterated on to achieve satisfactory results, and learnings in each stage can require revisiting prior stages. Below, we will examine how we approached each of these here at Truework.
Here are a few real job titles found in our data and how we eventually mapped them. They highlight some of the interesting features we see in raw job titles.
|Raw Title||Standardized Title(s)|
|Sr Sales Representative - SE Chester||Senior Sales Representative|
|Product Cont Strategist||Product Content Strategist|
|CX Analyst II||Customer Experience Analyst II|
|EVP & Chief Products Officer||Executive Vice President, Chief Product Officer|
|Inspector, Trust & Safety||Trust and Safety Inspector|
A job title can be represented with an n-gram. Each of the words in the n-gram contributes to one or more of the following facets:
The example above demonstrates how these facets provide structure to a title.
It is important to also consider what a job title is not. We believe that a job title must have a clear responsibility. The simplest job titles, such as "Manager" or "Engineer" are only a responsibility term. In the table above, title 6 is a title that we have decided is a "non-title." While the title includes a function (Banking or Financial Service) it does not contain a responsibility. We believe that a high-quality job title classification system must reject these n-grams that have been deemed non-titles.
These n-grams can also vary in length; in our dataset, the longest had 14 words. This has strong implications on the feasibility of mapping unique n-grams to their respective classes. In a language with finite terms and finite length, there are a reasonable number of unique representations with these constraints.
Of course, with n-grams of such short length, words in them will have a high signal-to-noise ratio. However, title 1 for example showcases those words which are irrelevant to the task of classifying job titles. Some common, yet unimportant, terms were geographical, company-specific, or just too vague. As such, we needed to develop a controlled vocabulary to ensure our classifiers are given relevant information.
Users expect to give our system an input job title and receive a standardized job title with an assortment of metadata. We are using this system internally for purposes of standardization, which opens the doors to many possibilities in quality control, integrations, and further research. Outlined below are the goals we set for ourselves at the start of the project.
1. Our system must preserve both the semantics and syntax of the raw input
Many existing classification systems prioritize the semantics of the job title. For example, the SOC system would classify an "Account Executive" as one of the several categories representing Technical or Non-Technical Sales Representatives and Agents. We desired a system where "Account Executives" would map to a new class because of the title's syntactic differences, but this new class should be similar to that class that represents "Sales Representative" as they are semantically similar.
2. Our system must be a comprehensive and accurate classifier.
This goal is common to other job title classification systems. We have seen tens of thousands of unique job titles and a classification system that cannot classify a very significant percentage of these cannot be successful. Further, it is important how accurate these classifications are. We needed to build a dataset of labeled job titles in order to measure this.
3. Our system must be robust.
Since we are dealing with human entered text, we need to be robust to misspellings of accepted terms. We need to be able to, with confidence, coerce unclean raw input into clean standardizations. We, however, must also not be too aggressive in this endeavor and reject titles we deem to be non-titles.
4. Our system must be easily maintained and developer-friendly
We could not afford to build a classification system that will need to be rebuilt in a year's time. Pipelines were needed to continually update the underlying classes with human oversight and allow data science to experiment with new methods of classification.
Goal 2 begins to hit on the metrics we chose to use to measure the success of this classification system. Our first priority is coverage. Coverage is the percentage of unique raw titles that receive a successful standardization. There is an extremely significant skew in the frequency of titles in our dataset. In order to provide a successful classification on 99% of the total job titles, we only needed to have classifications for the top 7000 unique titles. However, we felt this is much too limited in scope and would not generalize well to populations we had not seen, so our coverage goal was 80% of all unique titles with a frequency of at least three. This narrowed the number of titles to classify from millions to only tens of thousands.
After achieving coverage, the system must also be accurate. To measure this, we needed to generate a dataset of n-grams mapped to standardized titles. The process of labeling this dataset can be performed either by a small team internally or one of the several crowdsourcing solutions that exist.
Controlled Vocabularies are commonly used in the information sciences as well as computational linguistics. A controlled vocabulary is simply a set of permitted words that are used in this setting to filter the n-gram for non-essential terminology. To build our controlled vocabulary we simply took our dataset of unique n-grams and took the frequency of words in descending order. We removed those words with frequencies fewer than three and proceeded to determine if they were relevant to the classification system. After this procedure, we ended up with ~2000 terms in our controlled vocabulary. Included below are word clouds showing the most common terms in our controlled vocabulary grouped by facet.
We chose to implement a broad and shallow hierarchical structure for our classification system. In our structure, we have Functions, Title Families, and Titles. A Title Family belongs to 1 or more Functions and has 1 or more Titles belonging to it. This nearly flat structure allows us to capture a wide variety of syntax in job titles. For example, the Title "Senior Account Executive, Inbound" falls in the "Account Executive" Title Family, which in turn, belongs to the "Sales" Function.
In order to measure semantic similarity between Titles in different Title Families, we can look at the 3 facets of each Title. While "Junior Account Executive" and "Entry Level Sales Representative" will have different responsibilities, they will have similar seniority and equal functions. This allows us to retain a notion of semantic similarity in our representation.
The review of the classification system relies heavily on having frequent discussions about the classification structure with key stakeholders. At this point, stakeholders may give feedback on how the system handles their input and whether it satisfies the aforementioned use cases. We reached this stage and revisited the planning and drafting stages numerous times before being happy with our system. Once we reached a consensus, we began testing more rigorously.
It is at this point that we check to see how we stack up against our metrics for success. At the time of publishing this article, we have achieved 84% coverage of our unique title base with our classification system. This is simply whether or not the unique title is determined by our classification system to mappable to a standardized title. We then tested our system with the dataset of labeled n-grams. We started with using simple cosine distance to classify input n-grams to their standardized titles and utilize the accuracy metric as our performance metric. With this methodology, we have achieved 88% accuracy on a testing set extracted from our dataset of labeled n-grams.
We believe there are currently two main areas we can improve the most in. First, our classifier will only be as good as the data used to train it and we are looking into ways to improve labeling quality. Secondly, we can develop better representations of short text to feed to classifiers. Research has been done using algorithms such as AverageWord2Vec, Word Mower’s Distance, and Word Centroid Distance to improve the ability to understand similarities in a short text. We are actively testing and incorporating these algorithms in our work.
One of the most important parts of building a classification system is designing for the future. We track every title sent to our standardization API and work to incorporate new n-grams on a weekly basis either into currently existing classes or into new classes altogether. This requires revisiting our controlled vocabulary, various methods of handling non-alphanumeric characters, and our underlying classes on a regular basis to keep our system operating at its maximum potential. We are investigating methods of automatic clustering of short text to aid our specialists in deciding if unseen n-grams should be classified into existing classes or new ones as our classification system quickly expands in size and complexity. Creating the tools and procedures to keep pace with this rapidly evolving system are pivotal to the success of the project, as with many projects in startup environments.
Here at Truework, we are inspired by the sheer quality and volume of our ground-truth data. Advances in standardization is only one of the ways Data Science at Truework enables the rest of the company to thrive and improve. Our classification system allows us to maintain the semantic and syntactic relationships between the job titles one can find in the world all around them. Standardization opens doors to better quality control of outbound reports, faster integrations, and a better understanding of our users and customers. This is only the beginning and this project will continue to develop and evolve as we, and the problem itself, do as well. You can find documentation for our title standardization endpoint here.