|
Unnamed: 0,ID,Question,Level,Count,Status,Answer |
|
1,V1,What is the difference between a sample and population?,Very Easy,0,Not Asked, |
|
2,V2,Why do we collect or analyse a sample for inferential statistics? Why not study the whole population?,Very Easy,0,Not Asked, |
|
3,V3,,Very Easy,0,Not Asked, |
|
4,V4,Define numerical variable and categorical variable. Provide examples.,Very Easy,0,Not Asked, |
|
5,V5,,Very Easy,0,Not Asked, |
|
6,V6,,Very Easy,0,Not Asked, |
|
7,V7,,Very Easy,0,Not Asked, |
|
8,V8,What is null hypothesis and alternative hypothesis? Give an example.,Very Easy,0,Not Asked, |
|
9,V9,,Very Easy,0,Not Asked, |
|
10,V10,,Very Easy,0,Not Asked, |
|
11,V11,,Very Easy,0,Not Asked, |
|
12,V12,The slope parameter $ \hat{\beta} $ of a simple linear regression is +2. Interpret this number.,Very Easy,0,Not Asked, |
|
13,V13,The slope parameter $ \hat{\beta} $ of a simple linear regression is -2. Interpret this number.,Very Easy,0,Not Asked, |
|
14,E1,What are the major types of data statisticians work with?,Easy,0,Not Asked, |
|
15,E2,What are some of the challenges or issues in sampling of data? List out any three.,Easy,0,Not Asked, |
|
16,E3,,Easy,0,Not Asked, |
|
17,E4,What does standard deviation in a dataset tell us? What does it mean when standard deviation is small or large?,Easy,0,Not Asked, |
|
18,E5,Provide examples for discrete and continuous data.,Easy,0,Not Asked, |
|
19,E6,List out one difference between discrete and continuous variables. Provide an example of each.,Easy,0,Not Asked, |
|
20,E7,,Easy,0,Not Asked, |
|
21,E8,,Easy,0,Not Asked, |
|
22,E9,Explain any symmetric distribution - either discrete or continuous.,Easy,0,Not Asked, |
|
23,E10,,Easy,0,Not Asked, |
|
24,E11,List out properties of the normal distribution. Mention any three.,Easy,0,Not Asked, |
|
25,E12,Properties of continuous random variables. Mention any three.,Easy,0,Not Asked, |
|
26,E13,What is the area under the curve for a continuous distribution?,Easy,0,Not Asked,The area under the curve for a continuous distribution represents the probability of an event occurring within a certain range of values. It signifies the likelihood of a random variable falling within that range. |
|
27,E14,How does mean and standard deviation affect the shape of the normal distribution?,Easy,0,Not Asked, |
|
28,E15,,Easy,0,Not Asked, |
|
29,E16,What is an outlier and what are the ways of dealing with them?,Easy,0,Not Asked, |
|
30,E17,,Easy,0,Not Asked, |
|
31,E18,What is the meaning of null-hypothesis? What is the objective of hypothesis testing?,Easy,0,Not Asked, |
|
32,E19,What are the steps in hypothesis testing?,Easy,0,Not Asked,The steps in hypothesis testing typically include: A) Formulating null and alternative hypotheses. B) Collecting data and calculating a test statistic.C) Determining the significance level (alpha). D) Comparing the test statistic to a critical value or calculating a p-value. E) Making a decision to either reject or fail to reject the null hypothesis. F) Drawing conclusions based on the decision. |
|
33,E20,What is the z-score in the standard normal distribution? What does it measure?,Easy,0,Not Asked,The z-score in the standard normal distribution measures how many standard deviations a data point is from the mean. It provides information about the relative position of a data point within a normal distribution. |
|
34,E21,,Easy,0,Not Asked, |
|
35,E22,Write down an example for a null and two-sided alternative hypothesis.,Easy,0,Not Asked,Example of a null and two-sided alternative hypothesis: Null Hypothesis (H0): The mean test scores of Group A and Group B are equal. Alternative Hypothesis (H1 or Ha): The mean test scores of Group A and Group B are not equal. |
|
36,E23,Write down an example for a null and one-sided alternative hypothesis.,Easy,0,Not Asked,Example of a null and one-sided alternative hypothesis: Null Hypothesis (H0): The new treatment has no effect on the recovery time. Alternative Hypothesis (H1 or Ha): The new treatment decreases the recovery time. |
|
37,E24,State the null and alternative hypothesis when testing variances from two independent populations.,Easy,0,Not Asked,Null and alternative hypotheses when testing variances from two independent populations: Null Hypothesis (H0): The variances of Population 1 and Population 2 are equal (?1^2 = ?2^2). Alternative Hypothesis (H1 or Ha): The variances of Population 1 and Population 2 are not equal (?1^2 ? ?2^2). |
|
38,E25,State the null and alternative hypothesis when testing for difference of means from two independent populations. Either One-tail or two-tail.,Easy,0,Not Asked,Null and alternative hypotheses when testing for the difference of means from two independent populations can be either one-tailed or two-tailed:A)Two-Tailed: Null Hypothesis (H0): The means of Population 1 and Population 2 are equal (?1 = ?2). Alternative Hypothesis (H1 or Ha): The means of Population 1 and Population 2 are not equal (?1 ? ?2). B) One-Tailed (Left): Null Hypothesis (H0): The mean of Population 1 is greater than or equal to the mean of Population 2 (?1 ? ?2). Alternative Hypothesis (H1 or Ha): The mean of Population 1 is less than the mean of Population 2 (?1 < ?2). |
|
39,E26,Both the equal-variances and unequal variances techniques require that populations be normally distributed. How can you check if this requirement is satisfied?,Easy,0,Not Asked, |
|
40,E27,What is a confidence interval?,Easy,0,Not Asked, |
|
41,E28,,Easy,0,Not Asked, |
|
42,E29,What is the significance level $ \alpha $ (alpha)? What is the relationship between \alpha and confidence level? ,Easy,0,Not Asked, |
|
43,E30,,Easy,0,Not Asked,The choice of significance level ? depends on the specific research goals and the acceptable level of risk for making Type I errors (false positives). Common choices are ? = 0.05 (5% significance) and ? = 0.01 (1% significance). The decision depends on the trade-off between the desire for strong evidence (lower ?) and the potential for false positives. |
|
44,E31,,Easy,0,Not Asked, |
|
45,E32,,Easy,0,Not Asked, |
|
46,E33,,Easy,0,Not Asked,The Student's t-statistic or t-distribution is used when dealing with small sample sizes (typically less than 30) or when the population variance is unknown. It is used for hypothesis testing and confidence interval estimation when the population variance is not known and must be estimated from the sample data. |
|
47,E34,"For a study on variance, the alternative hypothesis is $ H_1: \sigma^2 < 1$. What is the null hypothesis for this problem?",Easy,0,Not Asked,"For a study on variance with the alternative hypothesis $ H_1: \sigma^2 < 1$, the null hypothesis would be $H_0: \sigma^2 \geq 1$." |
|
48,E35,What confidence intervals are used for? How can you increase or decrease the width of a confidence interval?,Easy,0,Not Asked,"Confidence intervals are used to provide a range of values for an unknown population parameter, allowing for estimation and uncertainty assessment. To increase the width of a confidence interval, you can use a higher confidence level (e.g., 99% instead of 95%), or to decrease the width, you can use a lower confidence level (e.g., 90% instead of 95%)." |
|
49,E36,Why do we use OLS linear regression models?,Easy,0,Not Asked,"Ordinary Least Squares (OLS) linear regression models are used to model and understand the relationship between dependent and independent variables in a dataset. They are valuable for predicting outcomes, identifying relationships, and assessing the impact of independent variables on the dependent variable." |
|
50,E37,"An analysis relates the age of used cars (in years) to their price (in USD), using data on a specific type of car. In a linear regression of the price on age, the slope parameter $\hat{\beta}$ is -700. Interpret the coefficient.",Easy,0,Not Asked,"In the linear regression of the price on age for used cars with a slope parameter $\hat{\beta}$ of -700, it means that, on average, the price of the car decreases by $700 USD for each additional year of age." |
|
51,E38,"An analysis relates the size of apartments (in square meters) to their price (in USD), using data from one city. In a linear regression of the price on size, the slope parameter $\hat{\beta}$ is +600. Interpret the coefficient.",Easy,0,Not Asked,"In the linear regression of the price on the size of apartments with a slope parameter $\hat{\beta}$ of +600, it means that, on average, the price of apartments increases by $600 USD for each additional square meter in size." |
|
52,E39,"In our linear regression model output, an R-squared $\R^2$ is reported, what does it mean and what do we use it for?",Easy,0,Not Asked,"The R-squared ($R^2$) in linear regression measures the proportion of the variance in the dependent variable that is explained by the independent variables. It indicates the goodness of fit of the model, with higher values indicating a better fit." |
|
53,E40,What are dummy or indicator variables? Provide examples.,Easy,0,Not Asked,"Dummy or indicator variables are used to represent categorical data in regression analysis. They are binary variables (0 or 1) that indicate the presence or absence of a category. For example, in a regression model for car types, you might have a dummy variable ""SUV"" taking the value 1 if it's an SUV and 0 otherwise. |
|
To include nominal independent variables in regression analysis, you can create dummy variables for each category and include them in the model. For example, if you have a Color variable with categories Red, Blue, and Green, you would create three dummy variables (e.g., RedDummy, BlueDummy, GreenDummy) and use them in the regression. |
|
The correlation coefficient measures the strength and direction of the linear relationship between two continuous variables. A positive correlation means that as one variable increases, the other tends to increase, while a negative correlation means that as one variable increases, the other tends to decrease. |
|
In the linear regression model, why do we require variation in values of X? What happens if there is no variation in values of X? In linear regression, we require variation in the values of the independent variable (X) to estimate the relationship with the dependent variable (Y). If there is no variation in the values of X (e.g., all X values are the same), it becomes impossible to estimate a meaningful relationship, and the model becomes unstable. |
|
An outlier is an extreme data point that deviates significantly from the majority of the data. They can be a problem in analysis as they can distort the regression model and lead to incorrect conclusions. Dealing with outliers can involve removing them, transforming the data, or using robust statistical techniques. |
|
In experimental design - what is a treatment group, and what is a control group?In experimental design, a treatment group is a group of subjects or items that are exposed to a specific treatment or intervention being studied. A control group is a group that is treated identically to the treatment group but does not receive the experimental treatment. The control group serves as a baseline for comparison to assess the impact of the treatment. |
|
What is the empirical rule, and when can it be helpful?The empirical rule, also known as the 68-95-99.7 rule, states that in a normal distribution, approximately 68% of the data falls within one standard deviation of the mean, about 95% falls within two standard deviations, and approximately 99.7% falls within three standard deviations. It can be helpful for quickly understanding the distribution of data and assessing the likelihood of values falling within certain ranges. |
|
The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution. It is important in statistics because it allows us to make inferences about population parameters using the properties of the normal distribution, even when the population itself may not be normally distributed. |
|
The Law of Large Numbers (LLN) states that as the sample size increases, the sample mean approaches the true population mean. In other words, with a sufficiently large sample, the sample mean becomes a more accurate estimate of the population mean. |
|
A sampling distribution is the distribution of a statistic (e.g., sample mean or sample proportion) computed from multiple random samples drawn from the same population. It provides information about the variability of the statistic and allows for statistical inference. |
|
Interpret the formula for z-score; $z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$. What are the numerator and denominator measuring? What does the z-score tell us about the sample mean, say if the value of z=1.5 or z=2?The formula for the z-score, z = (x? - ?) / (? / ?n), calculates how many standard deviations (?) a sample mean (x?) is away from the population mean (?), with n representing the sample size. The numerator measures the difference between the sample mean and the population mean, while the denominator measures the standard error of the sample mean. A z-score of 1.5 or 2 indicates that the sample mean is 1.5 or 2 standard deviations above the population mean, respectively. |
|
In a study comparing two independent samples using a difference of means test, you could investigate whether a new teaching method improves students' test scores compared to a traditional teaching method. The hypothesis could be: Null Hypothesis (H0): The mean test scores of students taught using the new method are equal to the mean test scores of students taught using the traditional method. Alternative Hypothesis (H1 or Ha): The mean test scores of students taught using the new method are not equal to the mean test scores of students taught using the traditional method." |
|
65,M7,Can you give an example of when to run a paired t-test and when to run an independent t-test?,Moderate,0,Not Asked,"You would run a paired t-test when you have two related groups or measurements, such as before and after measurements on the same individuals. An independent t-test is used when you are comparing the means of two separate and unrelated groups, like comparing test scores between two different schools." |
|
66,M8,"Explain Chi-square goodness of fit tests application, provide an example.",Moderate,0,Not Asked,"Chi-square goodness of fit tests are used to determine if observed categorical data fits an expected distribution. For example, you might use it to test whether the observed distribution of eye colors in a population matches the expected distribution based on a genetic model." |
|
67,M9,Explain Type I and Type II errors. Provide an example.,Moderate,0,Not Asked,"Type I error occurs when you reject a true null hypothesis (false positive), while Type II error occurs when you fail to reject a false null hypothesis (false negative). An example of Type I error is convicting an innocent person (rejecting the null hypothesis of innocence), and an example of Type II error is failing to convict a guilty person (failing to reject the null hypothesis of innocence)." |
|
68,M10,"In regression diagnostics, what does the following figure tell us: Histogram of residuals.",Moderate,0,Not Asked,"A histogram of residuals in regression diagnostics tells us about the distribution of the differences between observed and predicted values (residuals). It helps assess whether the residuals are approximately normally distributed, which is an assumption of linear regression." |
|
69,M11,"In regression diagnostics, what does the following figure tell us: Plot of residuals versus predicted values of y.",Moderate,0,Not Asked,"A plot of residuals versus predicted values of y in regression diagnostics helps identify patterns or trends in the residuals. It is used to check for heteroscedasticity, which is when the variability of the residuals changes across different levels of the independent variable." |
|
70,M12,Explain Heteroskedasticity.,Moderate,0,Not Asked,"Heteroskedasticity is a statistical term that describes the situation where the variability of the residuals in a regression model is not constant across different levels of the independent variable(s). In other words, the spread or dispersion of the residuals changes as the values of the predictor(s) change." |
|
71,M13,What is multicollinearity? How can we address this issue?,Moderate,0,Not Asked,"Multicollinearity refers to a situation in which two or more independent variables in a regression model are highly correlated with each other. It can be addressed by removing one of the correlated variables, using dimensionality reduction techniques, or combining the correlated variables into a composite variable." |
|
72,M14,How do you interpret regression coefficients in a logistic regression?,Moderate,0,Not Asked,"In logistic regression, regression coefficients represent the change in the log-odds of the dependent variable for a one-unit change in the independent variable. They indicate how the probability of the binary outcome changes as the independent variable(s) change." |
|
73,M15,"Provide an example of experimental design; what is the outcome of interest, explain treatment and control groups in the study.",Moderate,0,Not Asked,"An example of experimental design might involve testing a new drug's effectiveness for reducing blood pressure. The outcome of interest is blood pressure reduction, with the treatment group receiving the new drug, and the control group receiving a placebo. The study aims to compare the drug's effect on blood pressure between the two groups." |
|
74,M16,What is the difference between a correlation matrix and a linear regression model?,Moderate,0,Not Asked,"A correlation matrix shows the pairwise correlations between variables, indicating the strength and direction of linear relationships. A linear regression model, on the other hand, examines the relationship between a dependent variable and one or more independent variables, providing coefficients that represent the effect of the independent variables on the dependent variable." |
|
75,M17,Why cannot we use linear regression models when dependent variables is binary (0/1) or choice variable?,Moderate,0,Not Asked,"Linear regression models are not suitable for dependent variables that are binary (0/1) or choice variables because the assumptions of linearity, constant variance, and normality of residuals do not hold. Instead, logistic regression models are typically used for binary outcomes." |
|
76,M18,Can we assume that the regression model with the highest number of predictors is the best model? Why or why not.,Moderate,0,Not Asked,"We cannot assume that the regression model with the highest number of predictors is the best model because including more predictors can lead to overfitting, where the model performs well on the training data but poorly on new data. The choice of the best model should consider model fit, simplicity, and predictive performance on independent data." |
|
77,M19,What is the advantage of using multiple regression model as compared to a simple regression model?,Moderate,0,Not Asked,"The advantage of using a multiple regression model compared to a simple regression model is that it allows for the consideration of multiple independent variables simultaneously, capturing the potential joint effects of these variables on the dependent variable. This can lead to a more comprehensive understanding of the relationships in the data and improved predictive accuracy." |
|
|