Problem statement
Your task is to prepare a report presenting and interpreting the results of statistical analyses as described in the tasks below. The tasks include descriptive statistics, graphs, regression models and diagnostics, and interpretation.
CHDS Gestation Data
The data set for this exercise is a subset of the data collected as part of the Child Health and Development Studies, conducted between 1960 and 1967 in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The study examined birth weight and gestation, and their links with smoking. The data set you will examine is an extract containing live male births from 1961‐1962.
Data cleaning has been undertaken to combine categories, code factor variables, re‐scale weight from lbs to g/kg and to remove missing values. The results should be interpreted with this in mind.
List of variables:
id: identification number
gestation: length of gestation (in days)
age: mother’s age in years at termination of pregnancy
birth_wt: birth weight (in grams)
moth_wt: mother’s weight (in kgs)
parity3: total number of previous pregnancies (0=None, 1=One, 2=Two, 3=Three or more)
currentsmoke: current smoking status (0=No, 1=Yes)
preterm: Preterm birth born <259 days gestation (0=term, 1=preterm)
low_bwt: Babies weighing <2500 grams at birth (0 = No, 1 = Yes)
parity3_f: factor variable for parity3 (0=None, 1=One, 2=Two, 3=Three or more)
currentsmoke_f: factor variable for currentsmoke (No, Yes)
preterm_f: factor variable for preterm (Term, Preterm)
low_bwt_f: factor variable for low_bwt (No, Yes)
Please read the instructions carefully for each task, and take care to use the correct variables in the analyses.
Import data
Copy the following code into an R script file to load the data set and code factor variables.
# Load libraries
library(dplyr)
library(lbm)
library(ResourceSelection)
# Set your working directory to the location of the data file on your computer
dir <- "C:/MYPATH/ONMYCOMPUTER"
# Read data file
gest <-read.csv(paste0(dir,"/CHD_Gestation_subset_2022.csv"),
stringsAsFactors = F)
# Code factor variables
gest <- gest %>%
mutate(
parity3_f = factor(parity3, labels = c("None",
"One",
"Two",
"Three or more")),
currentsmoke_f = factor(currentsmoke, labels = c("No", "Yes")),
preterm_f = factor(preterm, labels = c("Term", "Preterm")),
low_bwt_f = factor(low_bwt, labels = c("No", "Yes"))
)
1 Summary statistics (200‐250 words)
a) Prepare a table of participant characteristics for the variables listed below, stratified by the binary variable currentsmoke_f, which indicates whether the mother is a current smoker.
• Mother’s Age in years (age)
• Mother’s weight in kgs (moth_wt)
• Previous pregnancies (parity3_f)
• Gestation length in days (gestation)
• Birth weight in grams (birth_wt)
• Preterm birth (<=259 days) (preterm_f)
• Low birth weight (<2500 grams) (low_bwt_f)
b) Provide a brief summary for each of the variables describing the differences or similarities based on current smoking status. Use only the data provided, and do not run any statistical tests.
2 Bivariate relationships
a) Present a scatter plot to display the association between birth weight (birth_wt) and mother’s weight (moth_wt).
b) Calculate the correlation coefficient for the association, and test the strength of relationship between the variables.
c) Present a boxplot to display the distribution of birth weight (birth_wt) by current smoking status(currentsmoke_f):
d) Write a brief summary of the associations you observe from parts a) to c) (150‐250 words).
3 Univariate associations with low birth weight (200‐400 words)
We commence the modelling stage of analysis by fitting univariable regression models to assess the associations between the outcome of interest and key exposure variables. Our outcome of interest for this task is birth weight in grams, and our primary exposure is current smoking status. We are also interested in assessing whether there is any confounding of this association by mother’s weight and preterm birth.
Variables of interest:
Outcome:
‐ Birth weight (birth_wt)
Exposure:
‐ Current smoking status (currentsmoke_f)
Confounders:
‐ Mother’s weight (moth_wt)
‐ Preterm birth (preterm_f)
a) Fit separate univariable (unadjusted) regression models to estimate the association of the outcome birth weight with each of the exposure and confounder variables.
Present the results in a table (combined with the results from Task 4 below). Ensure that you report regression/beta coefficients, 95% confidence intervals and associated p‐values. Prepare the table with a caption, labelled rows and columns and appropriately rounded numbers.
b) Write a summary, interpreting beta coefficients and 95% confidence intervals for smoking status, history of preterm labour, and weight at last menstrual period. You should comment on the direction and magnitude of the estimated effect and provide an interpretation of the p‐values. It is not necessary to interpret the intercepts from the models .
Please refer to the descriptions at the start of the assignment for information about each variable to ensure your interpretation is appropriate for the scale of measurement.
4 Multivariable (adjusted) associations with low birth weight (100‐200 words)
In epidemiology/biostatistics we are usually interested in describing the true association between a key exposure and an outcome. To estimate the “true” association we adjust our models for any confounders by adding them to the model to create a multivariable model. We continue the model fitting process with our outcome, birth weight in grams, and primary exposure, current smoking status.
a) Fit the multivariable model including all the variables from Task 3. We include mother’s weight and preterm birth to determine whether there is any evidence that these variables confound the association of birth weight in grams and smoking status.
Present the results for the multivariable regression model in the table prepared for Task Ensure that you report beta coefficients, 95% confidence intervals and the associated p‐values.
b) Write a summary paragraph, interpreting the beta coefficient and 95% confidence interval for primary exposure variable, current smoking status. You should comment on the direction and magnitude of the estimated effect and provide an interpretation of the p‐value.
Note: you are not required to interpret the estimates for mother’s weight or preterm birth in the multivariable model.
c) Summarise the evidence for a confounding effect of mother’s weight and/or preterm birth on the association of birth weight and history of preterm labour. (2‐3 sentences).
5 Regression diagnostics (100‐200 words)
It is important to assess the fit of regression models using a range of regression diagnostics. Complete the following tasks for the multivariable regression model from Task 4.
a) Prepare a plot of residuals vs fitted values and present as a formatted figure, including a caption.
b) Prepare a Q‐Q plot and present as a formatted figure, including a caption.
c) Comment on the model fit, and whether these diagnostics are sufficient to judge the model fit.
6 Univariate associations with low birth weight coded as a binary variable
(150‐300 words)
We now move our attention to the binary outcome variable for low birth weight. Babies weighing less than 2500 grams are classified as low birth weight, and have been coded as 1 in the numeric variable low_bwt, and represented as “Yes” in the derived factor variable low_bwt_f. Babies weighing 2500 grams or more are coded as 0 in low_bwt and represented as “No” in the factor variable low_bwt_f. Although either variable can be used in the analyses, we recommend using the factor variable low_bwt_f since the category labels will help ensure correct interpretation of results.
We continue using the same primary exposure, current smoking status (currentsmoke_f), and confounding variables, mother’s weight (moth_wt) and preterm birth (preterm_f) that were used in Tasks 3 and 4.
Note: this is a cohort study not a case‐control study.
a) Fit a univariable (unadjusted) regression model to estimate the risk of low birth weight (low_bwt_f) associated with current smoking (currentsmoke_f).
b) Fit separate univariable regression models to estimate the risk of low birth weight (low_bwt_f) associated with the potential confounders, mother’s weight (moth_wt) and preterm birth (preterm_f).
c) Present the results in a table (combined with the results from Task 7 below). Ensure that you report the estimated relative risk, 95% confidence intervals and associated p‐value for the exposure variable and confounders. Prepare the table with a caption, labelled rows and columns and appropriately rounded numbers.
d) Write a summary paragraph, interpreting the relative risk and 95% confidence interval for current smoking status, mother’s weight and preterm birth. You should comment on the direction and magnitude of the estimated effect and provide an interpretation of the p‐values.
7 Multivariable associations with low birth weight (100‐200 words)
The next task is to fit a multivariable model to estimate the risk of low birth weight associated with current smoking, after adjusting for potential confounders.
a) Fit the multivariable model including all the variables from Task 6. We include mother’s weight and preterm birth to determine whether there is any evidence that these variables confound the association of low birth weight with current smoking.
b) Present the results for the multivariable regression model in the table prepared for Task 6. Ensure that you report for each variable, the relative risk, 95% confidence interval and the associated p‐value.
c) Write a summary, interpreting the relative risk and 95% confidence interval for the primary exposure variable: current smoking status. You should comment on the direction and magnitude of the estimated effect and provide an interpretation of the p‐value.
Note: you are not required to interpret the relative risk estimates for the confounders, history of preterm birth and mother’s weight.
d) Summarize the evidence for a confounding effect of presence of preterm birth and/or mother’s weight on the association of low birth weight with current smoking status. (2‐3 sentences).
8 Goodness of fit (2‐3 sentences)
Assess the model fit for the multivariable model using the Hosmer‐Lemeshow test. Report the results (Chisquared value, df, p‐value) and provide a brief interpretation of the test.
This project can be used as final year project, capstone project, personal portfolio project, or proof of concept.
If you need implementation for the above problem or any of its variants, feel free to contact us.
Comments