top of page

Census income data set - category classification

Updated: Nov 3, 2021



Description :


The data extracted by Barry Becker using the1994 census dataset. Dataset contains 14 attributes consisting of 8 categorical and 6 continuous attributes containing information about age, education, nationality, marital status, relationship status, occupation, work classification, gender, race, working hours per week, capital loss and capital gain. The target variable in the dataset income level which predicts whether a person earns more than 50 thousand dollars per year or not based on the given set of attributes.


Recommended Model :


Algorithms to be used Decision tree Classifier, Random forest, svm’s, Logistic regression etc


Recommended Projects :


Determine the weather a person makes over 50,000 a year, predict income


Dataset link



Overview of data


Detailed overview of dataset

  • Records in the dataset = 32561 ROWS

  • Columns in the dataset = 15 COLUMNS

  1. age: Age of person (continuous).

  2. workclass: The working sector of a person (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.)

  3. fnlwgt: final weight The weights on the Current Population Survey (CPS) files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for the US by the Population Division here at the Census Bureau. ( continuous).

  4. Education: Qualification (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.)

  5. education-num : Education number continuous.

  6. marital-status: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.)

  7. Occupation: Occupation of person (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspect, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.)

  8. relationship: (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.)

  9. race: race of person ( White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.)

  10. sex : Gender of person Female, Male.

  11. capital-gain: Capital gain of a person per year ( continuous.)

  12. Capital-loss: Capital Loss of person per year ( continuous)

  13. hours-per-week: Work hours per Week (continuous)

  14. native-country: (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.)

  • Target variable :

  • Income: -Earn money >50K,<=50K per year.


EDA[Code]


Dataset


# load data 
import pandas as pd
file_loc="data\\adult.csv"
census_data = pd.read_csv(file_loc)
census_data.head()


Total Number of Rows and Columns in the dataset


shape=census_data.shape
print("Total records in the dataset :", shape[0])
print("Total columns in the dataset :", shape[1])


Check the details of dataset


# Data information
census_data.info()


Check the missing values in the dataset.


# Check the missing values in each column
census_data.isna().sum()

Statistical information


# Statistical information about the dataset
census_data.describe()


Data Visualization :


Correlation

import seaborn as sns
import matplotlib.pyplot as plt
# correlation
corr = census_data.corr()
corr.style.background_gradient(cmap='coolwarm')


Count plot of income

sns.set_style("whitegrid")
plt.figure(figsize = (8,5))
sns.countplot(x='income', data=census_data) 
plt.show()


Count plot of gender


sns.countplot(x='sex', data=census_data) 
plt.show()

Count plot of Workclass


plt.figure(figsize = (18,10))
sns.countplot(x='workclass', data=census_data) 
plt.show()


Other related data



If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.

Comentários


bottom of page