QQ1703105484
In this assignment, you will practise working with files, building and using dictionaries, designing functions using the Function Design Recipe, reading documentation, and writing unit tests.
A commonly-held belief is that an individual's health is largely influenced by the choices they make. However, there is lots of evidence that health is affected by systemic factors.
Health researchers often study the relationships between an individual's health outcomes and factors related to their physical environment, social and economic situations, and geographic location. Studies such as this one investigate how a particular health outcome (living with hypertension) are tied to a systemic factor (the income level of a country).
In this assignment, you will write code to assist with analysing data on the relationship between hypertension (also known as high blood pressure) and income levels in Toronto neighbourhoods. The data you will work with is real data, however we have simplified it somewhat to make this assignment clearer for you.
The data analysis that your code will do will include some statistical analysis that we have not talked about in the course. You do NOT need to understand the underlying statistics to complete this assignment. The code you write will do some simple mathematical operations, like adding up some numbers, or finding ratios using division. We will use Pearson correlation for the more advanced analysis and you will use existing functions that we have imported for you.
You will need to take a look at the examples of these functions in order to figure out what arguments you need to pass to them, and what types of data they return, but you do not need to understand how they work in any detail.
Correlation is a single coefficient expressing the tendency of one set of data to grow linearly, in the same or opposite direction, with another set of data. This is done by comparing whether points that have been paired between the two sets are similarly greater or less than than their set's respective averages.
For example, if we wanted to compare whether for students in the class, age is correlated with height, we would have two sets of data, birth date (which we could express as, say, number of weeks old for finer granularity), and heights.
Numbers from each set are ordered in the same way so that each height value corresponds to the age value for the same student. What is nice about the correlation metric we are using, is that it is normalised to be between -1 and 1, with these values giving us a nice human interpretation. A value of 1 means that the points make a straight line. In our example, this means, for some increase in age, we have a consistent increase in height. Similarly, a value of -1 is the same relationship but with a flip of direction, where older students would be shorter than younger ones. Finally, a value of 0 would say that there is no consistent increase or decrease in height for a change in age. We will use this to investigate the relationship between low income rates and hypertension, for any tendency to increase or decrease together.
If you are a statistics person, keep in mind that the learning goals of the assignment are about writing code using what we've learned in the course, not about doing a proper statistical analysis.
This assignment uses data files related to one of the two variables of interest (i.e., hypertension data or income data). The files are CSV (comma separated values) files, where each column in a line is separated by a comma. You can assume there are no commas anywhere else in the files, other than to separate columns, and that any file given is in the correct format. The two file types are described below.
The first row in a neighbourhood hypertension file contain header information, and the remaining rows each contain data relating to hypertension prevalence in a particular Toronto neighbourhood.
Here is a description of the different columns of the dataset. Notice the use of constants and carefully study the starter file constants.py
.
Column index | Description |
---|---|
HT_ID_COL | An ID that uniquely identifies each neighbourhood. |
HT_NBH_NAME_COL | The name of the neighbourhood. Neighbourhood names are unique. |
HT_20_44_COL | The number of people aged 20 to 44 with hypertension in the neighbourhood. |
NBH_20_44_COL | The total number of people aged 20 to 44 in the neighbourhood. |
HT_45_64_COL | The number of people aged 45 to 64 with hypertension in the neighbourhood. |
NBH_45_64_COL | The total number of people aged 45 to 64 in the neighbourhood. |
HT_65_UP_COL | The number of people aged 65 and older with hypertension in the neighbourhood. |
NBH_65_UP_COL | The total number of people aged 65 and older in the neighbourhood. |
The first row in a neighbourhood income data file contains header information, and the remaining rows each contain data about low income status.
Here is a description of the different columns of the dataset. Notice the use of constants and carefully study the starter file constants.py
.
Column index | Description |
---|---|
LI_ID_COL | An ID that uniquely identifies each neighbourhood. |
LI_NBH_NAME_COL | The name of the neighbourhood. Neighbourhood names are unique. |
POP_COL | The total population in the neighbourhood. |
LI_POP_COL | The number of people in the neighbourhood with low income status. |
Neighbourhood names and ids are the same between our hypertension data files and our low income data files. However, the total population of a neighbourhood can be different between the two data files, as they were collected at different times.
CityData
TypeThe code you will write for this assignment will build and then use a dictionary that contains hypertension and low income data about neighbourhoods in a city. This section describes the format of that dictionary.
CityData
dictionaryEach key in a CityData
dictionary is a string representing the name of a neighbourhood. As is necessary for dictionary keys, all neighbourhood names will be unique.
The values in a CityData
dictionary are dictionaries containing information about a neighbourhood. These neighbourhood data dictionaries contain specific keys that label a neighbourhood's data.
A dictionary that is a value in a dictionary of type CityData
has the following key/value pairs. Notice the use of constants and carefully study the starter file constants.py
.
Key | (Type) Value |
---|---|
ID | (int ) The id number of this neighbourhood. |
TOTAL | (int ) The total population of this neighbourhood, as given in the low income data file. |
LOW_INCOME | (int ) The number of people in this neighbourhood who are classified as low income. |
HT | (list[int] ) A list of the hypertension data of this neighbourhood. This list will have length exactly 6, and the values will be the numbers from columns HT_20_44_COL , NBH_20_44_COL , HT_45_64_COL , NBH_45_64_COL , HT_65_UP_COL , and NBH_65_UP_COL stored at indices HT_20_44_IDX , NBH_20_44_IDX , HT_45_64_IDX , NBH_45_64_IDX , HT_65_UP_IDX , and NBH_65_UP_IDX of the list, correspondingly. See the section above on neighbourhood hypertension data files. |
CityData
dictionaryThe following is an example of a CityData
dictionary. We have also provided this dictionary for you to use in your docstring examples and other testing in the starter code file. Note that we have formatted the dictionary below for easier reading, however you will not see this formatting in your code.
{'West Humber-Clairville': { 'id': 1, 'hypertension': [703, 13291, 3741, 9663, 3959, 5176], 'total': 33230, 'low_income': 5950}, 'Mount Olive-Silverstone-Jamestown': { 'id': 2, 'hypertension': [789, 12906, 3578, 8815, 2927, 3902], 'total': 32940, 'low_income': 9690}, 'Thistletown-Beaumond Heights': { 'id': 3, 'hypertension': [220, 3631, 1047, 2829, 1349, 1767], 'total': 10365, 'low_income': 2005}, 'Rexdale-Kipling': { 'id': 4, 'hypertension': [201, 3669, 1134, 3229, 1393, 1854], 'total': 10540, 'low_income': 2140}, 'Elms-Old Rexdale': { 'id': 5, 'hypertension': [176, 3353, 1040, 2842, 948, 1322], 'total': 9460, 'low_income': 2315}}
The sample CityData
dictionary above represents hypertension and low income data for five neighbourhoods: West Humber-Clairville, Mount Olive-Silverstone-Jamestown, Thistletown-Beaumond Heights, Rexdale-Kipling, and Elms-Old Rexdale.
Let's take a closer look at the data for Elms-Old Rexdale. This neighbourhood is represented by the key/value pair where the key is 'Elms-Old Rexdale'
. The id of this neighbourhood is 5. The hypertension data for this neighbourhood is as follows: 3353 people are between the ages of 20 and 44, 176 of whom have hypertension. There are 2842 people between the ages of 45 and 64, 1040 of whom have hypertension, and there are 1322 people aged 65 and up, 948 of whom have hypertension. The low income data for this neighbourhood is that 2315 people are classified as low income, from a total population of 9460 people.
Note that the totals do not match between the low income and the hypertension data — this is because the low income data was collected before the hypertension data, and the size of the neighbourhoods changed. For the purposes of this assignment, we will assume the collection of these two datasets is close enough in time to compare them to each other. You do not need to do anything about these differing totals, other than to make sure you are using the correct total when computing rates, as described later.
This section describes the process of age standardisation that we will use in this assignment to perform a more accurate analysis. Note that we have given you a function that computes the age standardised rate from the raw rate (described in Task 3). This section is for your information only; we have already implemented this for you.
Our dataset will let us calculate the rate of hypertension in each Toronto neighbourhood. One complicating factor is that different neighbourhoods have different age demographics. For example, the Henry Farm neighbourhood has a significantly lower proportion of 65+ residents than Hillcrest Village. And because people aged 65+ have a higher overall rate of hypertension, this demographic difference alone would cause us to expect to see a difference in the overall hypertension between these neighbourhoods.
So because we care about the impact of low income status on hypertension rates, we want to remove the impact of different age demographics between the neighbourhoods. To do so, we will use a process called age standardisation to calculate an adjusted hypertension rate that ignores differences in ages. This process involves the following steps for each neighbourhood:
Age Group | Population |
---|---|
20-44 | 11,199,830 |
45-64 | 5,365,865 |
65+ | 3,169,970 |
Total (20+) | 19,735,665 |
2,239,966 + 1,609,760 + 2,092,180 = 5,941,906
.5,941,906 / 19,735,665 x 100
or approximately 30%
. This percentage is the age standardised rate for the neighbourhood.If you are interested, you can read more about age standardised rates here.
In the starter code file a3.py
, follow the Function Design Recipe to complete the functions described below.
You will need helper functions (i.e., functions you define yourself to be called in other functions) for some of the required functions, but likely not for all of them. Helper functions also require complete docstrings with doctests. We strongly recommend you also follow any suggestions about helper functions in the table below; we give you these hints to make your programming task easier.
Some indicators that you should consider writing a new helper function, or using something you've already written as a helper are:
For each of the functions below, other than the file reading functions in Task 1, write at least two examples in the docstring. You can use the provided SAMPLE_DATA
dictionary, and you should also create another small CityData
dictionary for examples and testing. If your helper function takes an open file as an argument, you do NOT need to write any examples in that function's docstring. Otherwise, for any helper functions you add, write at least two examples in the docstring.
Your functions should not mutate their arguments, unless the description says that is what they do.
Assume the following about the data:
CityData
dictionary.The starter code contains constants in the file constants.py
that you should use in your solution for the list indices and key identifiers for the CityData
dictionary as well as the column numbers for the input files. You may add other constants if you wish, but DO NOT place them in the file constants.py
: instead put them in the a3.py
file.
In this task, you will write functions that read in files and build the dictionary of neighbourhood data. You will write two functions — one that adds hypertension data to a dictionary, and one that adds low income data. You will almost certainly also need to define one or more helper functions to help you solve this task.
These functions will be used to build a CityData
dictionary, however the dictionary that is passed to the functions may not yet contain all of the data.
To illustrate this, we have provided two small data files. After passing the same dictionary to both functions with each of those small files, the dictionary should be a CityData
dictionary that contains the same information as the provided SAMPLE_DATA
dictionary. Using the small hypertension file and an empty dictionary as arguments to get_hypertension_data
, the result should be that the dictionary now contains the hypertension data as in SAMPLE_DATA
, but not the low income data.
{'West Humber-Clairville': {'id': 1, 'hypertension': [703, 13291, 3741, 9663, 3959, 5176]}, 'Mount Olive-Silverstone-Jamestown': {'id': 2, 'hypertension': [789, 12906, 3578, 8815, 2927, 3902]}, 'Thistletown-Beaumond Heights': {'id': 3, 'hypertension': [220, 3631, 1047, 2829, 1349, 1767]}, 'Rexdale-Kipling': {'id': 4, 'hypertension': [201, 3669, 1134, 3229, 1393, 1854]}, 'Elms-Old Rexdale': {'id': 5, 'hypertension': [176, 3353, 1040, 2842, 948, 1322]}}
Similarly, using the small low income file and an empty dictionary as arguments to get_low_income_data
, the result should be that the dictionary now contains the low income data as in SAMPLE_DATA
, but not the hypertension data.
{'West Humber-Clairville': {'id': 1, 'total': 33230, 'low_income': 5950}, 'Mount Olive-Silverstone-Jamestown': {'id': 2, 'total': 32940, 'low_income': 9690}, 'Thistletown-Beaumond Heights': {'id': 3, 'total': 10365, 'low_income': 2005}, 'Rexdale-Kipling': {'id': 4, 'total': 10540, 'low_income': 2140}, 'Elms-Old Rexdale': {'id': 5, 'total': 9460, 'low_income': 2315}}
A complete CityData
dictionary will have been passed to both functions. See the sample usage at the end of the starter code file for an example of how both functions are used to build a CityData
dictionary.
Note: While this is the first task, it is not necessarily the easiest. If you are stuck while working on this task, we suggest moving on to other tasks and coming back to this later.
Recall that TextIO
as the parameter type means the file is already open.
Function name: (Parameter types) -> Return type | Full Description (paraphrase to get a proper docstring description) |
---|---|
get_hypertension_data :(dict, TextIO) -> None | The first parameter is a dictionary representing hypertension and/or low income data for a neighbourhood and the second parameter is a hypertension data file that is open for reading. This function should modify the dictionary so that it contains the hypertension data in the file. If a neighbourhood with data in the file is already in the dictionary then its hypertension data should be updated. Otherwise it should be added to the dictionary with its hypertension data. After this function is called, the dictionary should contain key/value pairs whose keys are the names of every neighbourhood in the hypertension data file, and whose values are dictionaries which contain at least the keys |
get_low_income_data :(dict, TextIO) -> None | The first parameter is a dictionary representing hypertension and/or low income data for a neighbourhood and the second parameter is a low income data file that is open for reading. This function should modify the dictionary so that it contains the low income data in the file. If a neighbourhood with data in the file is already in the dictionary then its low income data should be updated. Otherwise it should be added to the dictionary with its low income data. After this function is called, the dictionary should contain key/value pairs whose keys are the names of every neighbourhood in the low income data file, and whose values are dictionaries which contain at least the keys |
Function name: (Parameter types) -> Return type | Full Description (paraphrase to get a proper docstring description) |
---|---|
get_bigger_neighbourhood :(CityData, str, str) -> str | The first parameter is a Assume that the two neighbourhood names are different. If a name is not in the dictionary, assume it has a population of 0. If the two neighbourhoods are the same size, return the first name (i.e., the leftmost one in the parameters list, not alphabetically). |
get_high_hypertension_rate :(CityData, float) -> list[tuple[str, float]] | The first parameter is a Compute the overall hypertension rate for a neighbourhood by dividing the total number of people with hypertension by the total number of adults in the neighbourhood. You may assume that no neighbourhood has 0 population. If this function was called with the provided |
get_ht_to_low_income_ratios :(CityData) -> dict[str, float] | The parameter is a For the denominators for each rate, use the total number of people as given in the corresponding data file. That is, for calculating the low income rate, use the total population in the neighbourhood from the low-income data file; and for the hypertension rate, use the sum of the total people in all three age groups in the hypertension data. You may assume that no neighbourhood has 0 population. For example, if this function was called with the provided You will find that writing a helper function would be useful here. |
calculate_ht_rates_by_age_group :(CityData, str) -> tuple[float, float, float] | The first parameter is a For example, consider the neighbourhood with the name You may assume that no neighbourhood has a 0 population. Notice that this function is used as a helper in the |
Function name: (Parameter types) -> Return type | Full Description (paraphrase to get a proper docstring description) |
---|---|
get_correlation :(CityData) -> float | The parameter for this function is a To complete this function, you will need to use the You will need to use the provided function |
Function name: (Parameter types) -> Return type | Full Description (paraphrase to get a proper docstring description) |
---|---|
order_by_ht_rate :(CityData) -> list[str] | The parameter is a Assume every neighbourhood has a unique hypertension rate; i.e., that there are no ties. For example, if this function is called with the There are multiple ways to solve this problem. You may choose to solve this problem by writing your own sorting code, but you do not have to do this. You can also use |
unittest
)Write and submit a unittest file for the get_bigger_neighbourhood
function. We have provided starter code in the test_a3.py
file. We have included one test that you can use as a template to write your other test methods. For each test method, include a brief docstring description specifying what is being tested. Do not write examples in the docstrings. Your set of tests should all pass on correct code, and your tests should be thorough enough that at least one of them will fail on a buggy version of the function. There is no required number of tests; we will mark your tests by running them on the correct code as well as several buggy versions.
Download a3.zip which contains starter code (a3.py
and test_a3.py
), the checker (a3_checker.py
together with the helper file checker.py
and folder pyta
), and two sizes of each type of data file.
These are the aspects of your work that will be marked for Assignment 3:
The very last thing you do before submitting should be to run the checker program one last time.
Otherwise, you could make a small error in your final changes before submitting that causes your code to receive zero for correctness.
Submit a3.py
and test_a3.py
on MarkUs by following the instructions on the course website. Remember that spelling of filenames, including case, counts: your file must be named exactly as above.