Resume parsing with nltk, spacy and pdfminer

Resume parsing is a method of structuring the content present inside the resume. Resume parsers are programs designed to scan the document, analyze it and extract information. A resume parser may extract skills, work experience, contact information, achievements, education, certifications and certain professional specializations. For today we would be focusing on extracting skills.

Resume parsers available today are based on OCR or NLP techniques. The OCR techniques require a lot of data to train. NLP techniques also can’t provide 100% accuracy so it is still a developing field with many challenges ahead. Now let’s talk about what we are going to implement today.

Simple excel sheet based approach

We will take a very simple approach of matching n-gram words inside the text with the skills provided inside an excel sheet to extract skills. Let’s divide this into two sections: 1) Extracting text from pdf and 2) Extracting skills from the text. Now let's start with the code and its explanation. For the sample, we will be taking the resume of Juan Jose Carin whose pdf link is available at the end of the article.

Installing libraries and their dependencies

# run these commands in terminal 
pip install pdfminer.six nltk spacy==2.3.5 pyresparser
python -m spacy download en_core_web_sm

# for google colab or jupyter notebook
!pip install pdfminer.six nltk spacy==2.3.5 pyresparser
!python -m spacy download en_core_web_sm

import numpy as np
import pandas as pd
import spacy
import os
import re

Extracting text from pdf

We will be seeing two approaches for reading text from pdf that are defined in the documentation of the pdfminer library.

# Approach one
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io

text = ''
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
            converter = TextConverter(
                                resource_manager, 
                                fake_file_handle, 
                                codec='utf-8', 
                                laparams=LAParams()
                        )

            page_interpreter = PDFPageInterpreter(
                                resource_manager, 
                                converter
                            )
            page_interpreter.process_page(page)
            text = fake_file_handle.getvalue()
            yield text
            converter.close()
            fake_file_handle.close()
for page in extract_text_from_pdf('/content/resume_juanjosecarin.pdf'):
    text += ' ' + page

# Approach two
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('/content/resume_juanjosecarin.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

Both these approaches are great for extracting text so you can choose any method from above.

Extracting skills from text

For extracting skills from the text I have written an n-gram custom function as below.

def extract_skills(resume_text):
    nlp = spacy.load('en_core_web_sm')
    nlp_text = nlp(resume_text)
    data = pd.read_csv("/content/skill_list.csv") 
    skills = list(data.columns.values)
    skillset = []

    for i,sent in enumerate(nlp_text.sents):
      a = str(sent)
      a = re.sub('\n','',a)
      a = a.split(' ')
      str1 = ' '
      a = str1.join(a)
      if a.lower() in skills:
          skillset.append(a)
      skillset = set(skillset)
      skillset = list(skillset)

      tokens = [token.text for token in nlp_text]
      for token in tokens:
          if token.lower() in skills:
              skillset.append(token)

      doc = nlp(resume_text)  
      for token in doc.noun_chunks:
          token = token.text
          if token.lower() in skills:
              skillset.append(token)

      nlp_text = nlp(resume_text)
      for i,sent in enumerate(nlp_text.sents):
          a = str(sent)
          a = re.sub('\n','',a)
          a = a.split(' ')
          for j in a:
            for k in skills:
              k = k.replace(" ", "")
              k = k.replace("-","")
              if j.lower() == k:
                skillset.append(k)
      skillset = set(skillset)
      skillset = list(skillset)

      return [i.capitalize() for i in set([i.lower() for i in skillset])]

First, we will be loading the en_core_web_sm which according to spacy is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities. Then the next step is to read the contents of the excel sheet i.e. the skills that we want to extract into a list. Now simply looping through the whole text to find if any 1-gram or single-word skill matches both the text and skill to be extracted with also some text cleaning added. Now also trying with the tokens and the noun chunks method if any skills can be extracted from them. Now the final thing to look for is n-gram word skills which are implemented using 3 for loops looping over a sentence and then looping over the combination of words inside that sentence to find n-gram word skills.

Conclusion

We can see that the accuracy is good and if one doesn't wish to go using heavy models this approach with some modifications can cater to great results. Link to the sample resume and code.

Extracting skills from Resume without OCR!!!

Table of contents

Simple excel sheet based approach

Conclusion