Intelligent career planning via stochastic subsampling reinforcement learning
15 min readThis section discusses the reward function in SSRL and provides theoretical details of the search algorithm for the optimal career path. We summarize important notations and their definitions in the table shown in the supplementary materials.
Reward function
Now, we discuss the formulation of our reward function with three major components including company rating, periodic extent of suffering, and staying probability.
Company Rating. The company rating is formulated as a linear combination of company featurebased score and position featurebased score.
First, we introduce the company featurebased score, which is used to evaluate the overall “quality” of a company. The following four factors are considered.

(1)
Reputation: The reputation of a company is considered a representative factor of its business value and social impression, hence will significantly affect the job stability of employees. By investigating a sample containing 593 large publicly traded companies from 32 countries, Soleimani et al.^{45} found a positive impact of company reputation on its stock performance and the employee salary. Similar findings can also be found in^{46}. It has been found that people are more willing to work in companies with high reputation^{47}. In this paper, we quantify three levels of company reputation. For the first (highest) level, Fortune500 companies are rated with the highest reputation, and their reputation score is set to 1. Non Fortune500 companies with more than 10,000 employees are placed to the second level, with a reputation score (frac23). For the rest of the companies, we assign (frac13) as the reputation score. We also conducted experiments to evaluate the stability of the overall performance based on linear reputation. Related results are included in the supplementary materials (Fig. S.2).

(2)
Popularity: Popularity represents the overall social impact of a company. Based on experiments on an artificial market, Salganik et al.^{48} suggested that popularity is positively related to the quality of a company, and people are more willing to work in popular companies. Company popularity can be easily quantified based on talent move record. The frequency of talent incoming transfer indicates the popularity of a company. In our work, we normalize the total number of incoming transferring records to values in [0, 1], by dividing each by the maximum of the records.

(3)
Average Staying Duration: Employee stability is essential to business success^{49}, and the average staying duration of a company also represents the overall job satisfaction of its employees^{50}. We compute the average staying duration of employees for each company and then normalize them to [0, 1] based on the maximum value.

(4)
Smooth Transfer Rate: The smooth transfer rate measures how likely a jobhopping can be made. Given the dynamic market, a smooth job hopping indicates less risk hence is preferred by most job seekers^{6}. Considering the labor market shifts toward information and knowledge based work, talented workers are the intangible asset to a company^{37}. From the perspective of employers, to keep the competitive advantage, companies should accept suitable employees as soon as possible. To measure the smooth transfer rate, we have the following settings. For those companies with more than 100 transfer records, we calculate it as the ratio of the number of transfer records without waiting time (no time gap between the old and the new jobs) over the total amount of job transfers. For companies with less than 100 transfer records, which occupy only 3% of the full sample, we introduce a penalty term (0.8) to weaken the smooth transfer rate, as the number may be overestimated due to a small sample size.
Importantly, considering the fast changing labor market, company ratings may vary over time^{5,6}. The requirement from users is also changing over time. Assuming that (t’) denotes the time people start to work at the company (C_i), based on the above four features indicated by (f_l) where l is the feature index from 1 to 4, the company featurebased score of company (C_i) is defined as:
$$beginaligned N_C_i_CF^t’=100cdot fracsum _l=1^4 theta _l^t’ f_l,C_i^t’min _j in mathscrC sum _l=1^4 theta _l^t’ f_l,j^t’max _j in mathscrC sum _l=1^4 theta _l^t’ f_l,j^t’min _j in mathscrC sum _l=1^4 theta _l^t’ f_l,j^t’, endaligned$$
(8)
where (mathscrC) denotes the company list and (theta _l) denotes the weight of features. Users can express his/her preference to each company featurebased by specifying (theta _l). Note that (N_C_i_CF^t’) ranges from 0 to 100 based on the formulation.
On the other hand, we estimate the position featurebased score as follows. The job position contributes to one’s company rating in terms of the personjob fit, which is found positively correlated to job satisfaction^{51}. From the perspective of employers, job seekers are also encouraged to be matched to a backgroundfitting position^{37}. The position for the next company is related to the employee’s current experience. In this paper, the new job position is predicted by a job predictor based on the records in our dataset. The job predictor is determined by counting all the position transfer records for the current position and normalizing the top three most frequently selected positions. If the job position is the same as before, then people will get full credit for this position. If the position type is changed, we further evaluate the new work environment (i.e., whether employees get enough team support during the learning/training period).
As expert is instrumental when people face new difficulties in the team^{52} and the support from team is vital for overcoming the difficulty^{53}, we assume that if people can get enough support from expert team members, the new job is desirable even if it is a new position type. The number of experts in the given position should be positively related to the number of the positions in the company. Thus, to evaluate if people can obtain enough support from the team, we measure if there are sufficient number of coworkers at the same job position in the company. If the job is among the top3 major position types in the company, we assume that people are more likely to receive sufficient support; otherwise, insufficient support is assumed. To differentiate the above cases quantitatively, we have the following settings. Assuming that the previous position is (J_i1) and the current position is (J_i), the position featurebased score for company (C_i) can be defined as follows:
$$beginaligned N_C_i_PF= {left beginarrayll 100, & J_i=J_i1\ 80, & J_i ne J_i1; J_i text is among the top3 position types in C_i \ 60, & text otherwise endarrayright. endaligned$$
(9)
Please note that this is a simplified formulation we use to assess the overall personjob fit. More complicated formulation can be adopted in real cases.
Given company featurebased score (N_C_i_CF^t’) and position featurebased score (N_C_i_PF), we define the company rating of (C_i) as their linear combination:
$$beginaligned N_C_i=beta _1cdot left( s_1cdot N_C_i_CF^t’+s_2cdot N_C_i_PFright) , endaligned$$
(10)
where (s_1) and (s_2) are the weights of the two feature scores with (s_1 + s_2=1). Here (beta _1) is used to describe a negative effect of downtrend career moves on the company rate. People usually seek opportunities to work in better companies with improved work environment and potentials^{54}. It has also been found that people would have a decreased job performance with their new employers if the work environment did not improve^{37}. Thus, we have the following settings: (beta _1<1) if (N_C_i le N_C_i1); (beta _1 = 1) if (N_C_i > N_C_i1).
Periodic Extent of Suffering. Human capital transferring is not easy, and sometimes will lead to a job performance reduction^{37}. We refer to this as the suffering period after job hopping. According to a survey presented by Morse and Weiss^{55}, people are more willing to work in the same type of companies as their current one when seeking the next job. Evidence was also found that firmspecific skills plays an important role in employees’ performance^{37}. It is common that companies in the same business sector have similar positions (e.g., JP Morgan Chase and Bank of America). Thus, we estimate the extent of suffering by the similarity of the current and potentially new companies. A higher level of similarity is supposed to result in less suffering in the new company.
Let j be the order index of a given position type list, the percentage of the (j^th) position type in company (C_i) is computed as:
$$beginaligned Per(C_i,j)=fracbig M_C_i, endaligned$$
(11)
where (PT(C_i,k)) is the position type of the (k^th) position in company (C_i); (M_C_i) is the total number of position records in company (C_i). Then, we compute the cosine similarity between companies (C_i) and (C_i1) as follows:
$$beginaligned sim(C_i,C_i1)=fracsum _j=1^n_p left[ Perleft( C_i,jright) Perleft( C_i1,jright) right] sqrtsum _j=1^n_p [Perleft( C_i,jright) ]^2sqrtsum _j=1^n_p [Perleft( C_i1,jright) ]^2, endaligned$$
(12)
where the total number of position types (n_p = 26) in our data.
Estimator for Staying Probability and Duration. The staying probability not only helps estimate the duration, but also serves as an important component in our reward function. The following equation offers a simple way to estimate the staying probability for a given company:
$$beginaligned Pr(t)=1fracd(t)n(t), endaligned$$
(13)
where d(t) denotes the number of people left the company before time t; and n(t) represents the total number of employees in the company and (n(t) ge d(t)).
However, such estimation may be inaccurate due to data noise and incomplete records from existing employees. For a current employee in a company, we can only define his/her future situation as unknown or uncertain. To obtain a more reliable staying duration, we estimate the staying probability for such samples via survival analysis. Our problem can be considered as a rightcensoring condition, given that only the starting time of each job is known. Thus, we apply the Kaplan–Meier (KM) estimator from survival analysis. Specifically, we define staying at a company as “survive”, and leaving the company as “die”. Let d(x) denotes the number of “survivers” from ([x, x+Delta x)) and n(x) denotes the number of individuals at risk just prior to a given time x; i.e., the number of individuals in the sample who neither left nor were censored prior to time x. Then, the KM estimator for the staying probability can be written as:
$$beginaligned Pr(t)=KM(t)= prod _x le t left( 1fracd(x)n(x)right) , endaligned$$
(14)
where (x le t) represents all grid points such that (x+Delta x le t). See technical details of KM estimator in reference^{56}.
The staying probability can be used to estimate the staying duration of employees. It has been found in the literature that people tend to stay in similar type of companies in their career life^{57}. As the estimated staying probability represents a mainstream pattern, we have the following design to differentiate mainstream followers from others. Given a current company, we define the mainstream selection for the next employer as the top100 most selected companies in the data. Also, if a person chooses her new job from the mainstream list, the remaining companies on the list continue as the mainstream choices for the next jobhopping. Suppose a person is working at her (i^th) company, and (Psi (C_i)) is the set of top100 jobhopping selections for people who worked at company (C_i). The mainstream choices for the ((i+1)^th) jobhopping is defined as (Psi (C_i) cup Psi (C_i1) cup Psi (C_i2) cdots cup Psi (C_0)). Importantly, this design enables us to partially simulate the age and learning experience effects, which have been found to be important factors of career mobility^{58}. In the shortterm view, we assume that there will be more useful experience gain for the mainstream followers. Off the mainstream may indicate lower survival rate due to irrelevant experience. On the other hand, in the longterm view, if people try different types of companies/positions and received “panelties” in their early career stage, there will be a broader range of “mainstream” companies that they can survive in the later career stages.
Therefore, one’s staying duration at company (C_i) can be estimated according to (beta _2 Pr(C_i, t)), where (beta _2) is a penalty term, which is set to 1 if (C_i in Psi (C_i) cup Psi (C_i1) cup Psi (C_i2) cdots cup Psi (C_0)), or 0.8 otherwise. The staying duration can be easily determined when the model concludes the “leaving state” for a current company (C_i) based on (beta _2 Pr(C_i, t)) at time t.
Reward Formulation: The Final Form. To find the optimal path (P^*), we need to define a proper reward function R to guide the exploration and determine the optimal path. We define the reward as the accumulative score considering the company rating, the periodic extent of suffering, and the staying probability. Thus, the total reward for staying at company (C_i) is defined as:
$$beginaligned S_C_i= & int _0^t_1 N_C_icdot Pr(C_i, t) left( sim(C_i,C_i1)1right) dtnonumber \&+int _0^D_C_i N_C_icdot Pr(C_i, t) dt, endaligned$$
(15)
where (t_1) is the suffering period; (N_C_i) is the company rating; (D_C_i) means the duration at company (C_i); (Pr(C_i,t)) denotes the staying probability for company at time t; and (sim(C_i,C_i1)) is the similarity between two companies.
Equation (15) evaluates the reward for staying at company (C_i). The first component (int _0^t_1 N_C_icdot Pr(C_i, t) left( sim(C_i,C_i1)1right) dt) measures the weakening effect during the suffering period at a new company. The second component (int _0^D_C_i N_C_icdot Pr(C_i, t) dt) estimates the accumulative reward with decayed staying probability. To ease the computing, we divide the total time into small intervals (Delta t), and then Eq. (15) can be reformulated as:
$$beginaligned S_C_i&= Rleft( theta _l f_l,C_i1, J_i1, C_i, J_i, D_iright)\ &= N_C_ileft( left( sim(C_i, C_i1)1right) sum _k=1^t_1/Delta t Pr(C_i,k Delta t) +sum _k=1^D_C_i/Delta t Pr(C_i,k Delta t)right) Delta t endaligned$$
(16)
Theoretical analysis
In the following, we provide further analysis on our SSRL framework, focusing on the properties of the proposed CSR problem, our exploration strategy, the speedup of convergence, and their theoretical supports.
Fundamental Property. Given the starting state (C_0), assuming that we have found the optimal path (P^*) generated by the optimal policy (pi ^*), then we should have the following property.
Property 1
(Upper Bound of Policies) Given finite career time, if (P^*) is the optimal career path starting with (C_0), we should have (sum _j in P_i S_j le sum _j in P^* S_j), for any path (P_i) with the same initial state and career time length.
All proofs can be found in the supplementary information. According to our problem setting, the unpredictable size of the company set (mathscrC) may result in huge computing challenges, while our solution is to achieve the global optimum with fixedsized subsets (mathscrC_sub), leading to the following lemma as a special case of Property 1.
Lemma 1
(Upper Bound of Local Policy) Given an initial company (C_0), for any optimal paths generated by (pi ^*_mathscrC_sub) the optimal policy on (mathscrC_sub subseteq mathscrC), their accumulative rewards cannot exceed that of (P^*) generated by the optimal policy (pi ^*_mathscrC).
Lemma 1 demonstrates a challenging optimization task in our work, which is to achieve the global optima based on local exploration. That is, our SSRL needs to explore the global optima based on local policy (pi ^*_mathscrC_sub). The following proposition defines the sufficient and necessary condition of this task.
Proposition 1
(Boundary Condition) Suppose that we define “target companies” as those appear on the globally optimal career path. The global optima can be achieved by a local policy (pi ^*_mathscrC_sub), if and only if the target companies are included in the corresponding company subset (mathscrC_sub).
Proposition 1 indicates that it is possible to locate the global optimum by exploring the best locally optimal policy, instead of exploring the whole company pool. Our exploration method is designed to handle global optimization task by stochastically exploring local optima. Please note that the target companies cannot be discovered directly, while Proposition 1 guarantees that the target companies are included in the final company subset if the global optima is achieved.
Exploration strategy. Assuming that there are m companies in a company subset, the number of candidate subsets equals to (fracN!m!(Nm)!). Given a large total number of companies N, it is still expensive to find the best path generated by the local policy, if we have to explore all candidate subsets. As we aims to find the best locally optimal path, we have the following proposition.
Proposition 2
(Transformation Condition) With a large number of iterations, if the quality of the optimal path cannot be further improved based on randomly generated subsets, then the current optimal path is the global optima.
According to Proposition 2, as long as the policy is continuously optimized in a stochastic process, we will obtain the best local optima eventually. Thus, we need to make sure the exploration process will converge with a time limit. As we introduce several random events in our model (e.g., duration estimator, position predictor), the result for current optimal policy might be worse than the preserved one due to the uncertainty in the path generation. To accelerate the convergence, we develop a cooldown strategy based on the Boltzmann distribution [see Eq. (5)].
Cooldown strategies have been developed with simulated annealing techniques and are found efficient to handle sequential recommendation tasks^{22,28}. In this paper, the cooldown strategy is developed under a more complicated RLbased framework to speed up the career path exploration. Theoretically, the convergence is guaranteed, referring to our Proposition 3 as follows.
Proposition 3
(Convergence analysis) The cooldown strategy guarantees the convergence under the uncertain scenario in career path recommendation.
Baseline methods
Five baselines were implemented in this work, including JBMUL, IGM, MGM, TTD, and PDQN. We summarize their advantages and disadvantages in Table 3.

JBMUL. We follow the idea of reference^{59} to locate the best result at each time step. For selecting the next state, we calculate the accumulative reward of each company based on the current state. Then the model selects the maximum accumulative reward as the next state and continue.

IGM. This method aims to find the result which is better than the previous state. For the current state, we calculate the accumulative reward of each company. Comparing the result with previous state’s reward, as long as it is better than or equal to the previous one, we will place the company on the current state and continue to the next state.

MGM. This method is designed for finding the optimal solution based on the average accumulative reward of accepted companies. For the current state, as long as the calculated reward is better than or equal to the average reward of accepted companies, we will accept this allocation and continue to the next state.

TTD. As mentioned in reference^{60}, TTD is a traditional offpolicy temporal difference learning. After a fixed number of iterations, TTD will generate the policy based on the initial state.

PDQN. This baseline evaluates the implementation of our exploration strategy in deep RL. We regularize the action based on our exploration strategy (i.e., subset generation) and update the policy by the deep RL. We deploy a simple neural network with one hidden layer, the input and output size is set to 20, and the size of the hidden layer is 40.
Experimental settings
We set the time length of career path to 20 years and the precision of time interval (Delta t) is a quarter, the suffer time is set to 1 year. The default setting for the optional information in the input (e.g. working duration, work history) is none. For RL method, the discount rate (eta = 0.9), and the step size is set to 0.01. The total number of iterations is set to 1,000,000 steps; and if the total work time is more than 20 years, we will restart at the initial state. For SSRL and PDQN, we set the initial temperature T to 1, and the decay rate (Gamma = 0.99). The fixed length of the subset (mathcalC_sub) is set to 20, and we compare the result every 100 iterations. For each method, we simulate the career path on different numbers of processors and average the path score under the same grading criteria.
The major setting of the four scenarios in our experiments are related to the weight of the companyrelated features, including reputation, popularity, duration, position change rate and smooth transfer rate.

Scenario 1. General case (no specific user preference). This case shows the general case of our recommendation in which the user has no preference for the company features. The weights for all the features are 0.25.

Scenario 2. Personalized case (reputation preferred). This case indicates the path for those users who care more about the company reputation. The weight for the reputation is 0.6 and others are 0.1.

Scenario 3. Personalized case (timevarying preference). This case reflects the dynamical requirements from user at different career stage. The weights for the first half of career path are [0.1, 0.1, 0.1, 0.7] while the rest are [0.1, 0.7, 0.1, 0.1].

Scenario 4. Personalized case (specific company preferred). This case simulates the situation that user has a rough plan for his/her career. We set the plan as being transferred to the Bank of Boston at the third career stage for 4 years.
Experimental environment
All experiments were conducted on a highperformance computer (HPC) cluster with 164 compute nodes, each with two Intel Xeon E52683v3 20core processors and 128 GB DDR4 Memory.