Snapback
by
Data Cleaning and Profiling (D599 Task1)
The objective of this project is to profile and clean the dataset of a large multinational tech company to optimize it for future analysis. In today’s highly competitive tech industry, attracting and retaining top talent is a critical business priority. This will enable the tech business to design more effective employee retention strategies.
Part I: Data Profiling
· Initial rows/records: 10,199
· Initial columns/variables: 16
Variables Data Type
1 Employee number Quantitative/Discrete
2 Age Quantitative/Discrete
3 Tenure Quantitative/Discrete
4 Turnover Qualitative/Nominal
5 HourlyRate Quantitative/Continuous
6 HoursWeekly Quantitative/Continuous
7 CompensationType Qualitative/Nominal
8 AnnualSalary Quantitative/Continuous
9 DrivingCommute Quantitative/Discrete
10 JobRole Qualitative/Nominal
11 Gender Qualitative/Nominal
12 MaritalStatus Qualitative/Nominal
13 NumCompaniesPreviouslyWorked Quantitative/Discrete
14 AnnualProfessionalDevHrs Quantitative/Discrete
15 PayCheckMethod Qualitative/Nominal
16 TextMessageOptin Qualitative/Nominal
print(df.head())
EmployeeNumber Age Tenure Turnover HourlyRate HoursWeekly
0 1 28 6 Yes $24.37 40
1 2 33 2 Yes $24.37 40
2 3 22 1 No $22.52 40
3 4 23 1 No $22.52 40
4 5 40 6 No $88.77 40
CompensationType AnnualSalary DrivingCommuterDistance
0 Salary 50689.6 89
1 Salary 50689.6 89
2 Salary 46841.6 35
3 Salary 46841.6 35
4 Salary 284641.6 12
JobRoleArea Gender MaritalStatus \ 0 Research Female Married 1 Research Female Married 2 Information_Technology Female Single 3 Information_Technology Female Single 4 Sales Prefer Not to Answer Single
NumCompaniesPreviouslyWorked AnnualProfessionalDevHrs PaycheckMethod
0 3.0 7.0 Mail Check
1 6.0 7.0 Mail Check
2 1.0 8.0 Mailed Check
3 3.0 NaN Mailed Check
4 7.0 NaN Mail Check
TextMessageOptIn
0 Yes
1 Yes
2 Yes
3 Yes
4 Yes
Part II: Data Cleaning and Plan
B. Data Cleaning with Python
Data Quality Inspection and Findings
Find Duplicate Rows
In [138]:
df2=df.copy()
print(“\nDuplicate rows:”, df.duplicated().sum()) #99 rows x 16 columns print(“\nDuplicate rows preview:”) print(df2[df2.duplicated()].head())
df2=df2.drop_duplicates() print(df2) #10100, 16 To check for duplicates: o df.duplicated returns a boolean series for each row to determine how many duplicate rows exist o df.drop_duplicates() removes the duplicates to a new dataframe Duplicate Entries Findings: o Found and deleted 99 duplicate rows Duplicate rows: 99
Duplicate rows preview:
EmployeeNumber Age Tenure Turnover HourlyRate HoursWeekly
10100 1 28 6 Yes $24.37 40
10101 2 33 2 Yes $24.37 40
10102 3 22 1 No $22.52 40
10103 4 23 1 No $22.52 40
10104 5 40 6 No $88.77 40
CompensationType AnnualSalary DrivingCommuterDistance \ 10100 Salary 50689.6 89 10101 Salary 50689.6 89 10102 Salary 46841.6 35 10103 Salary 46841.6 35 10104 Salary 284641.6 12
JobRoleArea Gender MaritalStatus \ 10100 Research Female Married 10101 Research Female Married 10102 Information_Technology Female Single 10103 Information_Technology Female Single 10104 Sales Prefer Not to Answer Single
NumCompaniesPreviouslyWorked AnnualProfessionalDevHrs PaycheckMethod \ 10100 3.0 7.0 Mail Check 10101 6.0 7.0 Mail Check 10102 1.0 8.0 Mailed Check 10103 3.0 NaN Mailed Check 10104 7.0 NaN Mail Check
TextMessageOptIn 10100 Yes 10101 Yes 10102 Yes 10103 Yes 10104 Yes
EmployeeNumber Age Tenure Turnover HourlyRate HoursWeekly \ 0 1 28 6 Yes $24.37 40 1 2 33 2 Yes $24.37 40 2 3 22 1 No $22.52 40 3 4 23 1 No $22.52 40 4 5 40 6 No $88.77 40 ... ... ... ... ... ... ... 10095 10096 50 15 Yes $61.78 40 10096 10097 33 9 Yes $23.28 40 10097 10098 31 9 Yes $28.25 40 10098 10099 50 12 No $32.22 40 10099 10100 59 14 No $44.59 40
CompensationType AnnualSalary DrivingCommuterDistance \ 0 Salary 50689.6 89 1 Salary 50689.6 89 2 Salary 46841.6 35 3 Salary 46841.6 35 4 Salary 284641.6 12 ... ... ... ... 10095 Salary 128502.4 6 10096 Salary 48422.4 -10 10097 Salary 37960.0 68 10098 Salary 67017.6 -13 10099 Salary 92747.2 94
JobRoleArea Gender MaritalStatus \ 0 Research Female Married 1 Research Female Married 2 Information_Technology Female Single 3 Information_Technology Female Single 4 Sales Prefer Not to Answer Single ... ... ... ... 10095 Laboratory Male Divorced 10096 Marketing Male Single 10097 Laboratory Male Married 10098 Research Female Married 10099 Research Male Married
NumCompaniesPreviouslyWorked AnnualProfessionalDevHrs PaycheckMethod \ 0 3.0 7.0 Mail Check 1 6.0 7.0 Mail Check 2 1.0 8.0 Mailed Check 3 3.0 NaN Mailed Check 4 7.0 NaN Mail Check ... ... ... ... 10095 1.0 8.0 Mail Check 10096 1.0 20.0 Direct_Deposit 10097 2.0 21.0 DirectDeposit 10098 5.0 20.0 Direct_Deposit 10099 NaN 10.0 DirectDeposit
TextMessageOptIn 0 Yes 1 Yes 2 Yes 3 Yes 4 Yes ... ... 10095 Yes 10096 Yes 10097 NaN 10098 Yes 10099 NaN
[10100 rows x 16 columns] Inconsistent Entries
for col in df2.columns: unique_vals = df2[col].nunique() print(f”{col}: {unique_vals} unique values”) To check for inconsistent entries for qualitative variables: EmployeeNumber: 10100 unique values Age: 41 unique values Tenure: 20 unique values Turnover: 2 unique values HourlyRate : 5244 unique values HoursWeekly: 1 unique values CompensationType: 1 unique values AnnualSalary: 5538 unique values DrivingCommuterDistance: 120 unique values JobRoleArea: 12 unique values Gender: 3 unique values MaritalStatus: 3 unique values NumCompaniesPreviouslyWorked: 9 unique values AnnualProfessionalDevHrs: 21 unique values PaycheckMethod: 7 unique values TextMessageOptIn: 2 unique values
for col in df.select_dtypes(include=’object’).columns: print(f”\nColumn: {col}”) print(df[col].value_counts(dropna=False).head(10)) # top 10 categories Inconsistent Entries Findings:
Column: PaycheckMethod PaycheckMethod Mail Check 4986 Mailed Check 2441 DirectDeposit 992 Direct_Deposit 958 Mail_Check 547 Direct Deposit 226 MailedCheck 49 Name: count, dtype: int64 #Qualitative
df3=df2.copy() df3[‘JobRoleArea’] = df3[‘JobRoleArea’].astype(str).str.strip().str.title()
mapping = { ‘Information Technology’: ‘Info_Tech’, ‘Information_technology’: ‘Info_Tech’, ‘Informationtechnology’: ‘Info_Tech’, ‘Human Resources’: ‘Human_Resources’, ‘HumanResources’: ‘Human_Resources’ }
df3[‘JobRole’] = df3[‘JobRoleArea’].map(mapping).fillna(df3[‘JobRoleArea’]) df3[‘JobRole’].value_counts(dropna=False)
#Quantitative
df3[‘PaycheckMethod’] = df3[‘PaycheckMethod’].astype(str).str.strip().str.title()
mapping = { ‘Mail Check’: ‘Mailed_Check’, ‘Mailed Check’: ‘Mailed_Check’, ‘Mail_Check’: ‘Mailed_Check’, ‘Mailedcheck’: ‘Mailed_Check’, ‘Directdeposit’: ‘Direct_Deposit’, ‘Direct Deposit’: ‘Direct_Deposit’ }
df3[‘PayMethod’] = df3[‘PaycheckMethod’].map(mapping).fillna(df3[‘PaycheckMethod’]) df3[‘PayMethod’].value_counts(dropna=False)
#drop old columns df3.drop([‘PaycheckMethod’, ‘JobRoleArea’], axis=1, inplace=True) df3.info #10100x16
AnnualProfessionalDevHrs TextMessageOptIn JobRole \ 0 7.0 Yes Research 1 7.0 Yes Research 2 8.0 Yes Information_Technology 3 NaN Yes Information_Technology 4 NaN Yes Sales ... ... ... ... 10095 8.0 Yes Laboratory 10096 20.0 Yes Marketing 10097 21.0 NaN Laboratory 10098 20.0 Yes Research 10099 10.0 NaN Research
PayMethod 0 Mailed_Check 1 Mailed_Check 2 Mailed_Check 3 Mailed_Check 4 Mailed_Check ... ... 10095 Mailed_Check 10096 Direct_Deposit 10097 Direct_Deposit 10098 Direct_Deposit 10099 Direct_Deposit
[10100 rows x 16 columns]> Quality Issue Findings Inconsistent Entries• The column ‘JobRoleArea’ had multiple variations for entries ‘information technology’ and ‘human resources’ The column ‘PayCheckMethod’ had multiple variations for ‘Mailed_Check’ and ‘Direct Deposit’Do the same steps for ‘PayCheckMethod’ as ‘JobRoleArea’ to correct inconsistent entries • Create a mapping dictionary to correct spacing, white space, and capitalization inconsistencies• Rename column and drop old column Formatting Errors To check for formatting errors: o Df.describe returns negative ‘AnnualSalary’ and ‘DrivingCommuterDistance’ o Check for numeric columns stored as text, currency format, code consistency, and length checks Formatting Errors Findings: · Negative values were found for ‘Annual Salary’ and ‘DrivingCommuterDistance’ · ‘AnnualSalary’ was not formatted as currency
df3.columns = df3.columns.str.strip()
df3[df3.select_dtypes(include=’number’).columns] = df3.select_dtypes(include=’number’).abs()
for col in df3.select_dtypes(include=’object’).columns: df3[col] = df3[col].str.strip().str.title()
print(df3.value_counts(dropna=False))
df3[‘AnnualSalary’] = (df3[‘AnnualSalary’].astype(float) )
df3[‘AnnualSalary’] = df3[‘AnnualSalary’].apply(lambda x: “${:,.2f}”.format(x))
Out [143]:
EmployeeNumber Age Tenure Turnover HourlyRate HoursWeekly CompensationType AnnualSalary DrivingCommuterDistance Gender MaritalStatus NumCompaniesPreviouslyWorked AnnualProfessionalDevHrs TextMessageOptIn JobRole PayMethod
1 28 6 Yes $24.37 40 Salary 50689.6 89 Female Married 3.0 7.0 Yes Research Mailed_Check 1
6738 37 7 No $32.09 40 Salary 66747.2 52 Male Married 1.0 21.0 Yes Info_Tech Mailed_Check 1
6731 28 5 No $23.33 40 Salary 48526.4 11 Female Divorced 1.0 24.0 NaN Info_Tech Mailed_Check 1
6732 52 15 Yes $93.22 40 Salary 333897.6 53 Female Divorced NaN NaN Yes Manufacturing Mailed_Check 1
6733 58 7 Yes $94.53 40 Salary 336622.4 92 Male Married 6.0 18.0 Yes Manufacturing Mailed_Check 1
..
3367 48 19 No $92.51 40 Salary 332420.8 12 Female Single 7.0 12.0 NaN Research Mailed_Check 1
3368 28 4 Yes $23.80 40 Salary 49504.0 71 Female Married 3.0 10.0 NaN Sales Mailed_Check 1
3369 55 17 Yes $43.84 40 Salary 91287.2 44 Male Single 7.0 17.0 NaN Marketing Mailed_Check 1
3370 45 15 Yes $71.11 40 Salary 147908.8 12 Male Single 4.0 18.0 No Manufacturing Direct_Deposit 1
10100 59 14 No $44.59 40 Salary 92747.2 94 Male Married NaN 10.0 NaN Research Direct_Deposit 1
Name: count, Length: 10100, dtype: int64
for col in df3.select_dtypes(include=’object’).columns: lengths = df3[col].str.len() print(f”{col} min length: {lengths.min()}, max length: {lengths.max()}”) Out [145]: Turnover min length: 2, max length: 3 HourlyRate min length: 6, max length: 6 CompensationType min length: 6, max length: 6 AnnualSalary min length: 9, max length: 11 Gender min length: 4, max length: 20 MaritalStatus min length: 6, max length: 8 TextMessageOptIn min length: 2.0, max length: 3.0 JobRole min length: 5, max length: 22 PayMethod min length: 12, max length: 14 Formatting Errors Inspection:
df3.columns = df3.columns.str.strip()
df3[df.select_dtypes(include=’number’).columns] = df.select_dtypes(include=’number’).abs()
for col in df3.select_dtypes(include=’object’).columns: df3[col] = df3[col].str.strip().str.title()
print(df3.value_counts(dropna=False)) B1 Missing Values In [146]: import numpy as np
print(“Total missing values:”, df3.isna().sum().sum())
print(“Missing values per column:\n”, df3.isna().sum())
numeric_means = df3.select_dtypes(include=[np.number]).mean().round().astype(int)
df3.fillna(value=numeric_means, inplace=True)
for col in df3.select_dtypes(include=[np.number]).columns: df3[col] = df3[col].astype(int)
print(“Missing values after fill:\n”, df3.isna().sum()) print(“Data types after conversion:\n”, df3.dtypes)
missing_pct = df3.isna().mean().round(4) * 100 print(“\nMissing value percentage per column:”) print(missing_pct) Out [146]: Total missing values: 4868 Missing values per column: EmployeeNumber 0 Age 0 Tenure 0 Turnover 0 HourlyRate 0 HoursWeekly 0 CompensationType 0 AnnualSalary 0 DrivingCommuterDistance 0 Gender 0 MaritalStatus 0 NumCompaniesPreviouslyWorked 663 AnnualProfessionalDevHrs 1947 TextMessageOptIn 2258 JobRole 0 PayMethod 0 dtype: int64 Missing values after fill: EmployeeNumber 0 Age 0 Tenure 0 Turnover 0 HourlyRate 0 HoursWeekly 0 CompensationType 0 AnnualSalary 0 DrivingCommuterDistance 0 Gender 0 MaritalStatus 0 NumCompaniesPreviouslyWorked 0 AnnualProfessionalDevHrs 0 TextMessageOptIn 2258 JobRole 0 PayMethod 0 dtype: int64 Data types after conversion: EmployeeNumber int64 Age int64 Tenure int64 Turnover object HourlyRate object HoursWeekly int64 CompensationType object AnnualSalary object DrivingCommuterDistance int64 Gender object MaritalStatus object NumCompaniesPreviouslyWorked int64 AnnualProfessionalDevHrs int64 TextMessageOptIn object JobRole object PayMethod object dtype: object
Missing value percentage per column: EmployeeNumber 0.00 Age 0.00 Tenure 0.00 Turnover 0.00 HourlyRate 0.00 HoursWeekly 0.00 CompensationType 0.00 AnnualSalary 0.00 DrivingCommuterDistance 0.00 Gender 0.00 MaritalStatus 0.00 NumCompaniesPreviouslyWorked 0.00 AnnualProfessionalDevHrs 0.00 TextMessageOptIn 22.36 JobRole 0.00 PayMethod 0.00 dtype: float64 In [147]: df3.replace(“”, np.nan, inplace=True) # turn empty strings into NaN df3.fillna(‘NS’, inplace=True) # then replace NaN with 0 (or another value) Outliers To check for outliers: o Assign a variable to numeric columns o Calculate Q1, Q3 and IQR o Define upper and lower bounds o Flag outliers o Combine both outliers In [148]: numeric_cols = df3.select_dtypes(include=’number’).columns q1 = df3[numeric_cols].quantile(0.25) q3 = df3[numeric_cols].quantile(0.75) iqr = q3 - q1 lower_limit = q1 - 1.5 * iqr upper_limit = q3 + 1.5 * iqr
low_outliers = (df3[numeric_cols] < lower_limit).any(axis=1) high_outliers = (df3[numeric_cols] > upper_limit).any(axis=1) outliers = df3[low_outliers | high_outliers]
total_rows = df3.shape[0] num_outliers = outliers.shape[0] percent_outliers = (num_outliers / total_rows) * 100
outlier_counts = ((df3[numeric_cols] < lower_limit) | (df3[numeric_cols] > upper_limit)).sum()
summary_df = pd.DataFrame({ “Q1”: q1.abs(), “Q3”: q3.abs(), “IQR”: iqr.abs(), “Lower Limit”: lower_limit.abs(), “Upper Limit”: upper_limit.abs(), “Outlier Count”: outlier_counts.abs() })
print(f”\n— Outlier Summary —”) print(f”Total rows: {total_rows} | Outlier rows: {num_outliers} ({percent_outliers:.2f}%)\n”) print(summary_df.sort_values(“Outlier Count”, ascending=False)) Out [148]: — Outlier Summary — Total rows: 10100 | Outlier rows: 222 (2.20%)
Q1 Q3 IQR Lower Limit \ DrivingCommuterDistance 15.00 71.00 56.0 69.0 EmployeeNumber 2525.75 7575.25 5049.5 5048.5 Age 37.00 53.00 16.0 13.0 Tenure 5.00 13.00 8.0 7.0 HoursWeekly 40.00 40.00 0.0 40.0 NumCompaniesPreviouslyWorked 2.00 6.00 4.0 4.0 AnnualProfessionalDevHrs 11.00 19.00 8.0 1.0
Upper Limit Outlier Count DrivingCommuterDistance 155.0 222 EmployeeNumber 15149.5 0 Age 77.0 0 Tenure 25.0 0 HoursWeekly 40.0 0 NumCompaniesPreviouslyWorked 12.0 0 AnnualProfessionalDevHrs 31.0 0 Out [149]: In [150]: import seaborn as sns import matplotlib.pyplot as plt commuter = df3
column_name = df3[‘DrivingCommuterDistance’] df_commuter=pd.DataFrame(commuter,columns=column_name)
sns.boxplot(data=df3, x= column_name) plt.show() This image has an empty alt attribute; its file name is image.png out [150]: In [151]: import numpy as np import matplotlib.pyplot as plt
outlier_indices = np.where((df3[‘DrivingCommuterDistance’] >= 155) & (df3[‘DrivingCommuterDistance’] <= 69))
no_outliers = df3.drop(outlier_indices[0])
fig, ax_no_outliers = plt.subplots(figsize=(6, 4))
ax_no_outliers.scatter(no_outliers[‘DrivingCommuterDistance’], no_outliers[‘DrivingCommuterDistance’])
ax_no_outliers.set_xlabel(‘(Lower_Distance)’)
ax_no_outliers.set_ylabel(‘(Higher_Distance) )’)
plt.show()
This image has an empty alt attribute; its file name is 1*UvvldSPG90bYRC9D-9JK3Q.png
Outliers Findings:
· Q1 — First Quartile/25th Percentile:
o 25% of employees are younger than 37 years, earn less than $63,440, and commute less than 13 miles.
· Q3 — Third Quartile/75th Percentile:
o 75% of employees are younger than 53 years, earn less than $153,717, and commute less than 71 miles.
· IQR — Interquartile Range:
o The central salary range spans $90,277.20
· Lower Limit:
o Salaries below $71,975.80 are potential outliers
· Upper Limit:
o Salaries above $289,133.00 are potential outliers
Save new cleaned csv file
In [152]:
df3.to_csv(‘Employee_Turnover_Cleaned10.csv’, index=False)
import os
print(os.getcwd())
Out [152]:
/Users/meredithsmith/PycharmProjects/PythonProject/PythonProject/PythonProject44
In [153]:print(df3.head)
Out [153]:
<bound method NDFrame.head of EmployeeNumber Age Tenure Turnover HourlyRate HoursWeekly
0 1 28 6 Yes $24.37 40
1 2 33 2 Yes $24.37 40
2 3 22 1 No $22.52 40
3 4 23 1 No $22.52 40
4 5 40 6 No $88.77 40
… … … … … … …
10095 10096 50 15 Yes $61.78 40
10096 10097 33 9 Yes $23.28 40
10097 10098 31 9 Yes $28.25 40
10098 10099 50 12 No $32.22 40
10099 10100 59 14 No $44.59 40
CompensationType AnnualSalary DrivingCommuterDistance \ 0 Salary $50,689.60 89 1 Salary $50,689.60 89 2 Salary $46,841.60 35 3 Salary $46,841.60 35 4 Salary $284,641.60 12 ... ... ... ... 10095 Salary $128,502.40 6 10096 Salary $48,422.40 10 10097 Salary $37,960.00 68 10098 Salary $67,017.60 13 10099 Salary $92,747.20 94
Gender MaritalStatus NumCompaniesPreviouslyWorked \ 0 Female Married 3 1 Female Married 6 2 Female Single 1 3 Female Single 3 4 Prefer Not To Answer Single 7 ... ... ... ... 10095 Male Divorced 1 10096 Male Single 1 10097 Male Married 2 10098 Female Married 5 10099 Male Married 4
AnnualProfessionalDevHrs TextMessageOptIn JobRole \ 0 7 Yes Research 1 7 Yes Research 2 8 Yes Information_Technology 3 15 Yes Information_Technology 4 15 Yes Sales ... ... ... ... 10095 8 Yes Laboratory 10096 20 Yes Marketing 10097 21 NS Laboratory 10098 20 Yes Research 10099 10 NS Research
PayMethod 0 Mailed_Check 1 Mailed_Check 2 Mailed_Check 3 Mailed_Check 4 Mailed_Check ... ... 10095 Mailed_Check 10096 Direct_Deposit 10097 Direct_Deposit 10098 Direct_Deposit 10099 Direct_Deposit
[10100 rows x 16 columns]> C. Data Cleaning Techniques Used to Correct Data Quality Issues