NLP/NLU Series: Clustering Linkedin Profiles

Overview:

This notebook shows off how I built a simple model that leans heavily on the power of Sentence Transformers BERT to pull out lots of features. The model is pretty simple because it’s based on K-means, but there’s a ton of space to jazz it up and make it more complex. Basically, the algorithm I’ve got going here is a rock-solid starting point for the job of grouping similar LinkedIn profiles together.

Here’s a brief rundown of the algorithm:

  1. Extract BERT embeddings for sentences or textual data.
  2. Concat them into single vector.
  3. Use t-SNE to find optimal number of dimensions that explains the data.
  4. Reduce dimensionality using PCA.
  5. Find the optimal number of clusters from K-means by using distortion metric.
  6. Fit reduced data to optimal number of clusters extracted from the step above.
  7. Project Extracted clusters to original data.
from utils import *
from modeling_utils import *
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')
Utils, helper functions for visualizations:
def plot_missing_data(df):

    missing_data = df.isnull().sum() / len(df) * 100
    missing_data = missing_data[missing_data != 0]
    missing_data.sort_values(ascending=False, inplace=True)

    plt.figure(figsize=(6, 6))
    sns.barplot(y=missing_data.index, x=missing_data)
    plt.title('Percentage of Missing Data by Feature')
    plt.xlabel('Percentage Missing (%)')
    plt.ylabel('Features')
    plt.show()

def visualize_normalized_histogram(df, column, top_n=100, figsize=(6, 20)):

    value_counts = df[column].value_counts().nlargest(top_n)
    value_counts_normalized = (value_counts / len(df) * 100).sort_values(
        ascending=True)  

    colors = plt.cm.get_cmap('tab20')(np.arange(top_n))[::-1]

    plt.figure(figsize=figsize)
    plt.barh(value_counts_normalized.index, value_counts_normalized.values, color=colors)
    plt.ylabel(column)
    plt.xlabel('Percentage')
    plt.title(f'Normalized Value Counts Histogram of {column} (Top {top_n})')
    plt.xticks(rotation=0)
    plt.show()

def visualize_top_15_category_histogram(data,
                                        category_column,
                                        cluster_column,
                                        top,
                                        title,
                                        width,
                                        height):
                                        
    top_n_categories = data[category_column].value_counts().nlargest(top).index.tolist()
    filtered_data = data[data[category_column].isin(top_n_categories)]

    fig, ax = plt.subplots(
        figsize=(width / 80, height / 80)) 
    sns.histplot(data=filtered_data,
                      x=category_column,
                      hue=cluster_column, 
                      multiple="stack",
                      ax=ax)

    ax.set_title(title)
    for label in ax.get_xticklabels():
        label.set_rotation(90)
        label.set_fontsize(10)
    plt.show()


def get_latest_dates(df):
    df['sort_key'] = np.where(df['date_to'] == 0, df['date_from'], df['date_to'])
    df = df.sort_values(['member_id', 'sort_key'], ascending=[True, False])
    latest_dates = df.groupby('member_id').first().reset_index()
    latest_dates = latest_dates.drop(columns=['sort_key'])

    return latest_dates
Utils, helper functions for preprocessing:
def transform_experience_dates(experience):
    def transform_date_format(date_value):
        try:
            if isinstance(date_value, int) or date_value.isdigit():
                return str(date_value)  # Return the integer or numeric string as is
            else:
                date_string = str(date_value)
                date_object = datetime.strptime(date_string, "%b-%y")
                return date_object.strftime("%Y-%m")  # Format with year and month only
        except ValueError:
            return None

    def extract_year(value):
        if isinstance(value, str):
            pattern = r'\b(\d{4})\b'  # Regular expression pattern to match a four-digit year
            match = re.search(pattern, value)
            if match:
                return str(match.group(1))
        return None

    experience['transformed_date_from'] = experience['date_from'].apply(transform_date_format)
    experience['transformed_date_to'] = experience['date_to'].apply(transform_date_format)

    experience.loc[experience['transformed_date_from'].isnull(), 'transformed_date_from'] = experience.loc[
        experience['transformed_date_from'].isnull(), 'date_from'].apply(extract_year)
    experience.loc[experience['transformed_date_to'].isnull(), 'transformed_date_to'] = experience.loc[
        experience['transformed_date_to'].isnull(), 'date_to'].apply(extract_year)

    experience['transformed_date_from'] = experience['transformed_date_from'].str.replace(r'-\d{2}$', '', regex=True)
    experience['transformed_date_to'] = experience['transformed_date_to'].str.replace(r'-\d{2}$', '', regex=True)

    return experience
Utils, helper functions for modeling:
def find_optimal_dimensions_tsne(data, perplexity_range):
    dims = []
    scores = []

    max_dim = min(3, data.shape[1] - 1)

    for dim in range(1, max_dim + 1):
        if dim > len(perplexity_range):
            break

        tsne = TSNE(n_components=dim)
        embeddings = tsne.fit_transform(data)

        dims.append(dim)
        scores.append(tsne.kl_divergence_)

    # Plot the KL divergence scores
    plt.plot(dims, scores, marker='o')
    plt.xlabel('Number of dimensions')
    plt.ylabel('KL Divergence Score')
    plt.title('t-SNE: KL Divergence')
    plt.show()

    optimal_dim_index = scores.index(min(scores))
    optimal_dimensions = dims[optimal_dim_index]

    return optimal_dimensions


def reduce_dimensionality_with_pca(data, components):
    pca = PCA(n_components=components)
    reduced_data = pca.fit_transform(data)
    return reduced_data


def fit_kmeans_and_evaluate(data,
                            n_clusters=4,
                            n_init=100,
                            max_iter=400,
                            init='k-means++', 
                            random_state=42):
    data_copy = data.copy()

    kmeans_model = KMeans(n_clusters=n_clusters,
                          n_init=n_init,
                          max_iter=max_iter,
                          init=init,
                          random_state=random_state)
                          
    kmeans_model.fit(data_copy)

    silhouette = silhouette_score(data_copy, kmeans_model.labels_, metric='euclidean')
    print('KMeans Scaled Silhouette Score: {}'.format(silhouette))

    labels = kmeans_model.labels_
    clusters = pd.concat([data_copy, pd.DataFrame({'cluster_scaled': labels})], axis=1)

    return clusters

Basic Employee Features:

basic_features = pd.read_csv("Clean Data/basic_features.csv")
basic_features['member_id'] = basic_features['member_id'].astype(str)
basic_features.replace("none",np.NAN,inplace=True)

Employees Education:

education = pd.read_csv("Clean Data/employees_education_cleaned.csv")
# transform member_id to string for ease of use
education["member_id"] = education["member_id"].astype(str)
education = education[education["member_id"].isin(basic_features["member_id"])]
plot_missing_data(education)

My Image Description

education.drop(["activities_and_societies","description"],axis=1,inplace=True)
education[["date_from","date_to"]] = education[["date_from","date_to"]].fillna(0)
education[["date_from","date_to"]] = education[["date_from","date_to"]].astype(int)
education[["title","subtitle"]] = education[["title","subtitle"]].fillna("none")
# get the latest employee education obtained per member_id:
latest_education = get_latest_dates(education)
visualize_normalized_histogram(latest_education, 'title', top_n=100)

My Image Description

latest_education_drop_nan = latest_education.copy()
latest_education_drop_nan = latest_education_drop_nan[latest_education_drop_nan["subtitle"] != 'none']
# None is causing a problem it might influence the segmentation algorithm:
visualize_normalized_histogram(latest_education_drop_nan, 'subtitle', top_n=100)

My Image Description

Employees Experience:

experience  = pd.read_csv("Clean Data/employees_experience_cleaned.csv")
# transform member_id to string for ease of use
experience["member_id"] = experience["member_id"].astype(str)
experience = experience[experience["member_id"].isin(experience["member_id"])]
plot_missing_data(experience)

My Image Description

experience.drop(["description","location","Years","Months","duration","company_id"],
                 axis=1,inplace=True,
                 errors="ignore")
                 
experience[["date_from","date_to"]] = experience[["date_from","date_to"]].fillna(0)
experience["title"] = experience["title"].fillna("none")
experience.drop_duplicates(inplace=True)
experience = transform_experience_dates(experience)
experience = experience[["member_id","title","transformed_date_from","transformed_date_to"]]
experience.rename(columns={'transformed_date_from': 'date_from',
                           'transformed_date_to': 'date_to'},inplace=True)
visualize_normalized_histogram(experience[experience["title"]!="none"], 'title', top_n=100)

My Image Description

latest_experience = get_latest_dates(experience)
visualize_normalized_histogram(latest_experience, 'title', top_n=120)

My Image Description

Basic Features:

basic_features.isnull().sum()
member_id                   0
title                      75
location                    2
industry                 1175
summary                  1399
recommendations_count       0
country                     0
connections_count           0
experience_count            0
latitude                    0
longitude                   0
months experience        1460
number of positions      1460
number of degrees         660
years of educations       660
dtype: int64
Remove columns with high percentage of missing values as they affect the results of cluster:
plot_missing_data(basic_features)

My Image Description

basic_features["industry"] = basic_features["industry"].fillna("other")
basic_features["title"] = basic_features["title"].fillna("other")
basic_features["location"] = basic_features["location"].fillna("unknown")
basic_features[["number of degrees","years of educations"]] = basic_features[
                                                              ["number of degrees",
                                                              "years of educations"]
                                                              ].fillna("0")
basic_features.drop(["months experience",
                     "number of positions",
                     "summary"],
                      axis=1,
                      inplace=True)
visualize_normalized_histogram(basic_features[basic_features["title"]!="other"],
                                              "title",
                                               top_n=120)

My Image Description

visualize_normalized_histogram(basic_features[basic_features["industry"]!="other"],"industry",top_n=120)

My Image Description

visualize_normalized_histogram(basic_features,"location",top_n=120)

My Image Description

Merge employees basic features, the latest education and experience:
latest_experience.drop(["date_from","date_to"],axis=1,inplace=True,errors="ignore")
latest_experience.rename(columns={"title":"experience_title"}, inplace=True)
latest_experience.head(10)

member_idexperience_title
01000769811Sales Channel Advisor
11001027856Product Development Assistant
21001731893VP RD
31002107022Digital Marketing Ecommerce Consultant
41002900696Branch Manager
51003503234Founding Shareholder
61004617047RRH
71004931912Senior Product Development Specialist
81005561303Sales consultant
9100559115Marketing B2B Specialist
latest_education.drop(["date_from","date_to"],axis=1,inplace=True,errors="ignore")
latest_education.rename(columns={"title":"education_title","subtitle":"education_subtitle"}, inplace=True)
latest_education.head(10)

member_ideducation_titleeducation_subtitle
01000769811University of California Santa BarbaraBA Business EconomicsPhilosophy Double Major
11001027856British Academy of Interior DesignPostgraduate Diploma Interior Design
21001731893The Academic College of TelAviv YaffoComputer Science
31002107022Epping forest college2 A levels Computer studies
41002900696bridge road adult educationOCN Psychology criminal Psychology Psychosocia...
51003503234Lancaster UniversityBSc Hons in Management
61004617047Université Paris II AssasMaster II Droit et pratique des relations du t...
71004931912University of MichiganPost doc Radiopharmaceutical Chemistry in Nucl...
81005561303Rother Valley CollegeBtec diploma in business finance business pas...
9100559115CONMEBOLCertified Sports Managment
overall_features  = basic_features.merge(latest_education, on='member_id', how='outer').merge(latest_experience, on='member_id', how='outer')
additional preprocessing:
string_columns = ["education_title",
                  "country",
                  "industry",
                  "location",
                  "title",
                  "education_subtitle",
                  "experience_title"]
overall_features[string_columns] = overall_features[string_columns].fillna('none')

numerical_cols = ["experience_count",
                  "connections_count",
                  "years of educations",
                  "number of degrees",
                  "recommendations_count",
                  "longitude",
                  "latitude"]
overall_features[numerical_cols] = overall_features[numerical_cols].fillna(0)
overall_features.isnull().sum()
member_id                0
title                    0
location                 0
industry                 0
recommendations_count    0
country                  0
connections_count        0
experience_count         0
latitude                 0
longitude                0
number of degrees        0
years of educations      0
education_title          0
education_subtitle       0
experience_title         0
dtype: int64
visualize_none_percentages(overall_features)

My Image Description

Drop rows that contain “none” as it effects the performance of Clustering:
overall_features = overall_features[~overall_features.apply(lambda row: row.astype(str).str.contains('none')).any(axis=1)]
overall_features = overall_features[~overall_features.apply(lambda row: row.astype(str).str.contains('other')).any(axis=1)]
overall_features[["title","industry","location","country","education_title","education_subtitle","experience_title"]].head(5).style.background_gradient()
 titleindustrylocationcountryeducation_titleeducation_subtitleexperience_title
4I Help Professionals Make Career Business Breakthroughs Coaching Consulting Personal Branding Resumes LinkedIn Profile Thought Leadership DevelopmentInformation Technology & ServicesDallas, Texas, United StatesUnited StatesHarvard UniversityBachelor of Arts BA Computer Science Focus on Artificial Technology Machine Learning and Education TechologySVP Customer Success
6Sr Research Engineer at BAMF HealthMedical DeviceGrand Rapids, Michigan, United StatesUnited StatesGrand Valley State UniversityMaster of Science Engineering Biomedical EngineeringImage Processing Research Engineer
7Head of New Business at Cube OnlineEvents ServicesSydney, New South Wales, AustraliaAustraliaUniversity of the West of EnglandBA Hons Business StudiesAPAC Hunter Manager
8Veneer Sales Manager at Mundy Veneer LimitedFurnitureTaunton, England, United KingdomUnited KingdomNorthumbria UniversityMasters Degree MSc Hons Business with ManagementProject Coordinator
10ML Genomics Datadriven biologyComputer SoftwareCambridge, Massachusetts, United StatesUnited StatesTechnical University of MunichDoctor of Philosophy PhD Computational BiologyComputational Biologist
Extract Sentence Embeddings:
### extract high dimensional embeddings using sentence transformer BERT:
title_embeddings = get_embeddings(overall_features,'title')
industry_embeddings = get_embeddings(overall_features,'industry')
location_embeddings = get_embeddings(overall_features,'location')
country_embeddings = get_embeddings(overall_features,'country')
education_title = get_embeddings(overall_features,'education_title')
education_subtitle = get_embeddings(overall_features,'education_subtitle')
experience_title = get_embeddings(overall_features,'experience_title')
Merge with simple features:
merged_embeddings = np.concatenate((
    title_embeddings,
    industry_embeddings,
    location_embeddings,
    country_embeddings,
    education_title,
    education_subtitle,
    experience_title
), axis=1)

additional_numerical_features = overall_features[[
    'recommendations_count',
    'connections_count',
    'experience_count',
    'latitude',
    'longitude',
    'member_id'
]].values

simple_features = basic_features[['recommendations_count',
                                  'connections_count',
                                  'experience_count',
                                  'latitude',
                                  'longitude', ]]

final_data = np.concatenate((merged_embeddings, additional_numerical_features), axis=1)
final_data = pd.DataFrame(final_data)


# keep a list or ordered members_ids to use later for explanations:
members_ids = final_data.iloc[:, -1].tolist()

# drop members_id, as it's not used in modeling the data, and it will 
# lead to misleading results:
final_data = final_data.drop(final_data.columns[-1], axis=1)
Find Optimal Number of Components:
find_optimal_dimensions_tsne(merged_embeddings, [5,10,15,20,25,30,35,40,45,50])

My Image Description

Reduce Embedding Dimensionality:
reduced_merged_embeddings = reduce_dimensionality_with_pca(merged_embeddings,3)
reduced_merged_embeddings = pd.DataFrame(reduced_merged_embeddings)
reduced_merged_embeddings

012
0-0.309824-0.3573020.296566
1-0.611905-0.678734-0.044584
20.1035920.1978990.296579
30.664791-0.1134970.114943
4-0.656240-0.5529840.348519
............
11400.779930-0.0712030.002138
11410.859704-0.109584-0.098612
1142-0.5021350.7470440.863029
11430.561427-0.077284-0.185717
1144-0.5421310.734136-0.657090

1145 rows × 3 columns

Scale Simple Numerical Features:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(simple_features)
scaled_data = pd.DataFrame(scaled_data)
Merge with Reduced Embedding Vectors:
merged_embeddings = pd.DataFrame(merged_embeddings)
all_features = pd.concat([merged_embeddings, scaled_data], axis=1)
Find optimal number of K that describes the data with the smallest distortion score:
model = KMeans(n_init=10)
visualizer = KElbowVisualizer(model, k=(2,10))
visualizer.fit(reduced_merged_embeddings)
visualizer.show()

My Image Description

<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
merged_embeddings_clusters = fit_kmeans_and_evaluate(reduced_merged_embeddings,
                                                     4,
                                                     n_init=100,
                                                     max_iter=100000,
                                                     init='k-means++',
                                                     random_state=412)
KMeans Scaled Silhouette Score: 0.5201332569122314
# rename extracted clusters:
merged_embeddings_clusters = merged_embeddings_clusters.rename(columns={0:"component 1",
                                                                        1:"component 2",
                                                                        2:"component 3"})
merged_embeddings_clusters["member_id"] = members_ids
overall_features["member_id"] = overall_features["member_id"].astype(int)
merged_embeddings_clusters["member_id"] = merged_embeddings_clusters["member_id"].astype(int)
overall_results = overall_features.merge(merged_embeddings_clusters,on='member_id')
overall_results["cluster_scaled_string"] = overall_results["cluster_scaled"].astype(str)
visualize_top_15_category_histogram(overall_results,
                                    category_column="industry",
                                    cluster_column="cluster_scaled_string",
                                    top=80,
                                    title="Clusters Labels Relative to Industry",
                                    width=1000,
                                    height=800
                                    )

My Image Description

visualize_top_15_category_histogram(overall_results,
                                    category_column="location",
                                    cluster_column="cluster_scaled_string"
                                    ,top=50,
                                    title="Cluster Labels Relative to Location",
                                    width=1000,
                                    height=800
                                    )

My Image Description

visualize_top_15_category_histogram(overall_results,
                                    category_column="title",
                                    cluster_column="cluster_scaled_string",
                                    top=30,
                                    title="Cluster Labels Relative to title",
                                    width=1000,
                                    height=800
                                    )

My Image Description

visualize_top_15_category_histogram(overall_results,
                                    category_column="country",
                                    cluster_column="cluster_scaled_string",
                                    top=30,
                                    title="Cluster Labels Relative to Country",
                                    width=1000,
                                    height=800
                                    )

My Image Description

visualize_top_15_category_histogram(overall_results,
                                    category_column="education_title",
                                    cluster_column="cluster_scaled_string",
                                    top=30,
                                    title="Cluster Labels Relative to Country",
                                    width=1000,
                                    height=800
                                    )

My Image Description

visualize_top_15_category_histogram(overall_results,
                                    category_column="education_subtitle",
                                    cluster_column="cluster_scaled_string",
                                    top=30,
                                    title="Cluster Labels Relative to Education",
                                    width=1000,
                                    height=800
                                    )

My Image Description

visualize_top_15_category_histogram(overall_results,
                                    category_column="experience_title",
                                    cluster_column="cluster_scaled_string",
                                    top=100,
                                    title="Cluster Labels Relative to Latest Experience Title",
                                    width=1200,
                                    height=800
                                    )

My Image Description

overall_results.to_csv("Clean Data/overall_results.csv",index=False)

Observations:

  • By squishing together the big, complex text features, K-means did a pretty solid job. With a Distortion Score of 152 and a Silhouette Score of 0.5201, we think it’s a good place to start.
  • We can probably make the clusters even better if we pull in more info like recommendations, education, and job history.
  • The model seems to be pretty good at putting employees from the same industries together. Take a look at cluster 2 - it grouped retail and furniture folks together.
  • In terms of where people are from, the model’s done a good job. It’s been putting places from the same country together, like Israel and its districts and cities in cluster 0, and it did the same with US cities in cluster 1. This kind of data looks like it’d work well with hierarchy-based models like HDBSCAN.
  • One thing to note is that the model likes to stick all the empty values or “none” or “other” categories together, mostly because they look alike when transformed into vectors. To steer clear of this, I decided to leave them out.
  • Just a heads-up that I used K-means here, which is pretty simple and can be thrown off by outliers. In the future, it could be worth making a fancier model that’s tougher against outliers.