: # Initialize Otter

import otter

grader = otter.Notebook(“ps4.ipynb”)

1 Econ 140 – Problem Set 4

Before getting started on the assignment, run the cell at the very top that imports otter and the

cell below which will import the packages we need.

Important: As mentioned in problem set 0, if you leave this notebook alone for a while and come

back, to save memory datahub will “forget” which code cells you have run, and you may need to

restart your kernel and run all of the cells from the top. That includes this code cell that imports

packages. If you get not defined errors, this is because you didn’t run an earlier

code cell that you needed to run. It might be this cell or the otter cell above.

[4]: import numpy as np

import pandas as pd

import statsmodels.api as sm

1.1 Problem 1. Efficient Markets Hypothesis

Does the stock market efficiently use information in valuing stocks? The Efficient Markets Hypothesis (“EMH”), developed by Nobel-prize winner Eugene Fama, maintains that current stock

prices fully reflect all available information. An implication of this hypothesis is that returns in

the current period should not be systematically related to information known in earlier periods.

Otherwise, we could use this information to predict stock returns, thus violating EMH. As an analyst at an investment management company, you have been tasked with examining the validity

of the EMH. You obtained a dataset of 142 randomly-selected firms listed on the New York Stock

Exchange, consisting of the following four variables:

Variable Description

return Total return from holding a firm’s stock over a one-year period, from

January 2014 to December 2014. Note that an annual return such has

31.4% is entered in the dataset as 31.4.

dkr A firm’s debt to capital ratio in 2013.

lnetincome Natural log of the net income for a firm in 2013.

lsalary Natural log of the total compensation for a firm’s CEO in 2013.

1

Using these data, you estimated the following two regressions.

Regression 1

Regression 2

Question 1.a. Based on the results for the two OLS regressions, what is the sign of the correlation

between dkr and lnetincome? Alternatively, is there not enough information to determine the sign

of the correlation?

Type your answer here, replacing this text.

Question 1.b. Interpret the coefficient on lnetincome in Regression 2.

Type your answer here, replacing this text.

Now suppose you added another variable to the regression, and obtained the following regression

results.

2

Regression 3

Question 1.c. Suppose that you use Regression 3 to examine whether EMH holds. What are the

null and alternative hypotheses?

Type your answer here, replacing this text.

Question 1.d. Carry out the test in part (c) at the 5% level. Do you reject or fail to reject the

null hypothesis?

Type your answer here, replacing this text.

Question 1.e. Interpret the result you obtained in part (d), in light of your task of examining the

validity of EMH.

Type your answer here, replacing this text.

Question 1.f. Provide (at least) two reasons why there might be imperfect multicollinearity

present in Regression 3.

Type your answer here, replacing this text.

Question 1.g. Which of the following statements is true based on a comparison of Regression

2 and Regression 3? – (i) dkr and lnetincome are highly-correlated. – (ii) dkr and lsalary are

highly-correlated. – (iii) lnetincome and lsalary are highly-correlated. – (iv) All of the above. –

(v) None of the above.

Type your answer here, replacing this text.

Question 1.h. The sample of 142 stocks only include companies that were traded on the NYSE

as of the end of 2013. A company that went out of business, for instance, before the end of that

year could not enter the sample. How would this sampling affect the estimated coefficient relative

to the population regression?

Type your answer here, replacing this text.

3

1.2 Problem 2. Airlines and Antitrust

Antitrust authorities have long been concerned that airline carriers may exercise their market power

by charging higher fares. The greatest concern arises when one airline runs the vast majority of

flights in and out of an airport. Usually this happens when an airline designates an airport as

a national or regional “hub” of their operations. The dataset airfares.csv consists of average

fares and other characteristics of popular U.S. origin-destination pairs (e.g., Boston-Chicago) for

the year 2000.

Variable Description Units

lfare logarithm of the average fare

on the route

log of fare in 2000 dollars

dist distance of the route thousands of miles

passen average number of

passengers per day

thousands of passengers

concen market share of biggest

airline carrier on the route,

measured in terms of

passengers carried

fraction (e.g., 0.55 = 55%

market share)

origin city of origin of flight

destin city of destination of flight

: af = pd.read_csv(“airfares.csv”)

af.head()

Question 2.a. Regress lfare on dist, passen and concen, with robust standard errors. Make

sure the cell below (and all regression questions in this assignment) shows your regression results

like you’ve done in previous assignments, otherwise we cannot give credit. This assignment will be

a little less guided. Make sure do use different variable names for each separate coding part to avoid

unexpected errors from reusing variables. Refer to previous assignments if you need a refresher on

how we performed different regressions. Don’t forget to add a constant to your regressions.

Question 2.b. What is the interpretation of the coefficient on passen?

Type your answer here, replacing this text.

Question 2.c. Based on your OLSEs, and assuming the OLS assumptions hold, what is the partial

effect of the market share of the largest carrier on air fares? Is your answer consistent with the

hypothesis that firms use their market power to charge higher prices?

Type your answer here, replacing this text.

Question 2.d. How would you test whether market power is used the same way on more popular

and less popular routes? Write down the model and the hypothesis, carry out the estimation and

the test.

This question is for your code, the next is for your explanation.

4

Question 2.e. Explain.

Type your answer here, replacing this text.

Question 2.f. We need to question whether the results of the regression in part (d) is revealing

a causal relationship between concentration and airfares. In particular, we are concerned whether

our estimation results on U.S. data are valid for other markets, such as Europe and Asia. Give one

reason why the results would not be “externally valid” if applied to the airline industry in one of

these other two regions.

Type your answer here, replacing this text.

Question 2.g. We are also aware of several potential threats to “internal validity” of the results.

For each one of the five main internal validity threats, describe one possibility that could plausibly

lead to that particular threat.

Type your answer here, replacing this text.

1.3 Problem 3. World Health Organization

The World Health Organization (“WHO”) collects data which assesses the health care outcomes

of the populations in 191 countries across the globe, as well as exploring potential explanations for

those outcomes. These data are published in the annual “World Health Report.” The file who.csv

contains five years (1993-1997) of these data. The variables in the panel of countries include:

Variable Description

comp composite measure of health care attainment

dale disability-adjusted life expectancy

year 1993,1994,1995,1996,1997

hexp per capita health expenditure

hc3 educational attainment (tertiary schooling)

country number assigned to country

oecd dummy indicator for an OECD member country

gini Gini coefficient for income inequality

geff World Bank measure of government effectiveness

voice World Bank measure of democratization of the political process

tropics dummy indicator of tropical location

popden population density (people per square mile)

pubthe proportion of health expenditure paid by public authorities

gdpc normalized per-capita GDP

[5]: who = pd.read_csv(“who.csv”)

who.head()

Question 3.a. Create a new variable for the dataset that is the square of educational attainment

(hc3). Then regress life expectancy (dale) on health expenditures (hexp), the educational attain5

ment in the country (hc3), and its square (the variable you created). For now, select rows from

1997 and use only these rows in the regression. Use robust standard errors and don’t forget to

add a constant term. Comment on whether you think the relationship between life expectancy and

education is linear or quadratic and why you came to that conclusion.

This question is for your code, the next is for your explanation.

Question 3.b. Explain.

Type your answer here, replacing this text.

Question 3.c. To the specification in part (a), add the additional control variables: gini, tropics,

popden, pubthe, gdpc, voice, and geff. Test whether these additional regressors are jointly

significant (we do the F-test for you in this part, you just have to interpret it). What effect does

inclusion of these additional controls have on the coefficients of the other included regressors?

This question is for your code, the next is for your explanation.

[7]: # This is the code for your regression.

# We give you starter code for this one so that we know what the variable name␣

,→is

# for the regression results, which we use in the code cell below.

model_3b = …

results_3b = …

results_3b.summary()

[8]: # Please don’t change this cell, just run it.

# This is how you do an F-test. Notice that we do .f_test on the results of the

# unrestricted model, and then we give the names of the variables we want to

# test inside quotation marks.

results_3b.f_test(“gini, tropics, popden, pubthe, gdpc, voice, geff”).summary()

Question 3.d. Explain.

Type your answer here, replacing this text.

Question 3.e. Return to the simpler regression specification in part (a). We want see if the

determinants of life expectancy are different for rich and poor countries. Use membership in the

“Organization of Economic Cooperation & Development” (oecd) as the indicator of a rich country.

The OECD had 30 member countries during this time period. Perform a test of the hypothesis

that all three of the coefficients in the population regression are equal for OECD and non-OECD

countries.

Hint: You will need to create three new variables.

This question is for your code, the next is for your explanation.

6

[52]: # This extra code cell may be helpful

…

Question 3.f. Explain.

Type your answer here, replacing this text.

Question 3.g. Give an example of a time-invariant variable that would result in different life

expectancy across countries.

Type your answer here, replacing this text.

Question 3.h. Estimate the regression having a fixed effect for each country in the sample. We

have defined the endogenous and exogenous variables for you, you just have to fill in the rest.

Notice how we converted the country variable to a set of dummy variables for each country. You

can ignore the coefficients for every country variable. What change took place in the coefficients

on the education variables? Explain why you think there was a change in these coefficients.

This question is for your code, the next is for your explanation.

[49]: # .get_dummies transforms a categorical variable into a dataframe of dummy␣

,→variables,

# one for each category. The prefix and prefix_sep part just makes sure the␣

,→variable

# names are strings and not integers.

countries = pd.get_dummies(who[‘country’], prefix=”, prefix_sep=”)

# This just joins the dummy dataframe with the original

who_country = who[[‘dale’, ‘hexp’, ‘hc3’, ‘hc3^2’]].join(countries)

y_3h = who_country[‘dale’]

# Here we drop country 191, since otherwise there would be perfect colinearity␣

,→in

# the columns. We also have to drop dale since that’s the endogenous variable we

# regress on.

X_3h = sm.add_constant(who_country.drop(columns=[‘dale’, ‘191’]))

model_3h = sm.OLS(…, …)

results_3h = model_3h.fit(…)

results_3h.summary()

Question 3.i. Explain.

Type your answer here, replacing this text.

Question 3.j. Give an example of an entity-invariant variable, which is excluded from the estimated regression model in part (a), that would result in variation in life expectancy over time.

Type your answer here, replacing this text.

Question 3.k. Perform regression with time fixed effects. Are the results consistent with your

reasoning about the entity-invariant variables? The procedure for this question will be similar to

3.h. Drop the dummy variable for 1993 for this question.

This question is for your code, the next is for your explanation.

7

Question 3.l. Explain.

Type your answer here, replacing this text.

Question 3.m. Perform a test that all time fixed effects are jointly equal to zero. Remember that

we excluded 1993. What is the result of your test?

This question is for your code, the next is for your explanation.

Question 3.n. Explain.

Type your answer here, replacing this text.