UK: +44 748 007-0908, USA: +1 917 810-5386 [email protected]

Tayko Software Cataloger

Tayko.csv is the dataset should be used
Background
Tayko is a software catalog firm that sells games and educational software. It started out as a software manufacturer and later added third-party titles to its offerings. It has recently put together a revised collection of items in a new catalog, which it is preparing to roll out in a mailing.
In addition to its own software titles, Tayko’s customer list is a key asset. In an attempt to expand its customer base, it has recently joined a consortium of catalog firms that specialize in computer and software products. The consortium affords members the opportunity to mail catalogs to names drawn from a pooled list of customers. Members supply their own customer lists to the pool, and can “withdraw” an equivalent number of names each quarter. Members are allowed to do predictive modeling on the records in the pool so they can do a better job of selecting names from the pool.
The Mailing Experiment
Tayko has supplied its customer list of 200,000 names to the pool, which totals over 5,000,000 names, so it is now entitled to draw 200,000 names for a mailing. Tayko would like to select the names that have the best chance of performing well, so it conducts a test—it draws 20,000 names from the pool and does a test mailing of the new catalog.
For ease of presentation, the dataset for this case includes just 1000 purchasers and 1000 nonpurchasers, an apparent response rate of 0.5.
Data
There are two outcome variables in this case. Purchase indicates whether or not a prospect responded to the test mailing and purchased something. Spending indicates, for those who made a purchase, how much they spent. The overall procedure in this case will be to develop two models. One will be used to classify records as purchase or no purchase. The second will be used for those cases that are classified as purchase and will predict the amount they will spend.
Table 1 provides a description of the variables available.

Questions:

  1. Develop two models for classifying a customer as a purchaser/nonpurchaser
    a. Partition the data randomly into training set (60%), validation set (40%). (random_State=1)
    b. Use logistic regression for classification.
    i. Predictor variables are decided based on your judgement call and dependent variable is “Purchase” (Hint: think should include the variable ‘Spending’ or not in your model)
    ii. Fit the logistic regression model and display logistic results in a dataframe using valid data
    iii. Calculate the accuracy score of your logistic regression classification model using valid data and briefly explain how you think of your model
    c. Use kNN for classification.
    i. Use the same set of predictors and outcome as the logistic regression
    ii. Normalization and use the normalized data to find the best k
  2. Hint: Since there is NO new classification data point as in class example, the codes for scaler in normalization can be rearranged in a simpler way.
    Codes for scaler in normalization:
    outcome = ‘put the dependent variable name here’
    predictors = [‘put the independent variable names here’]
    scaler = preprocessing.StandardScaler()
    scaler.fit(trainData[predictors])
  3. Give the normalized predictors of train data and valid data to different names than train_X and valid_X, since you already used it in the logistic regression. If you continue to use the same names, kNN will find the best k based on the logistic regression train and valid data, which is mismatch.
    iii. Calculate the accuracy score of your kNN classification model based on the best k and briefly explain how you think of your model
    d. Between logistic regression and kNN, which model you choose for classification and why?
  4. Develop a model for predicting spending among the purchasers.
    a. Create subsets of the training and validation sets for only purchasers’ records by filtering for Purchase value is 1.
    i. Hint: PID= list(put the dataframe name here.loc[put the dataframe name here.Purchase == 1].sequence_number)
    ii. # == means equal to in Python
    b. Develop a model for predicting spending with the filtered datasets using linear regression
    i. Step 1: develop new train data and valid data that only has the purchase =1 records (hint: put the new train dataset name here = trainData.loc[trainData.Purchase == 1])
    ii. Step 2: have new X and y that represents your predictors and dependent variable (hint: the name of X, y must be different from what you used in logistic regression and kNN, otherwise mismatch happens)
    iii. Step 3: give the value of dependent variable to outcome (outcome = ‘put your dependent variable name here’), so as to the new train data’s and valid data’s predictors and outcome (put the new train_X name here= put the new train dataset name in i. Step 1here[predictors], put the new train_y name here= put the new train dataset name in i. Step 1here[outcome])
    iv. Step 4: fit in the linear regression model and show the train results (or valid results) in dataframe
    Deliverables:
  5. A code file (Jupyter Notebook file) that addresses all questions above

Ready to Score Higher Grades?