library(reticulate)
#py_install("pandas")
#py_install("polars")
#py_install("seaborn")Final Data Preparation
Introduction
Data for this project was processed from the original download in .gz format. This script makes final changes related to formatting the data for the table, including setting adding metadata columns.
Preliminaries
Load reticulated so I can work in Python while in RStudio:
Load the Python packages:
#import polars as pl
import pandas as pd
import numpy as np
from great_tables import GT, md, html, style, locLoad the distance matrix. The _pd suffix is a reminder that I’m working in Pandas. Eventually I hope to port as much of these final tasks to Polars.
dist_mat_pd = pd.read_csv("distance_matrix.csv")Finalizing dataframe
Additional meta-data
The dataframe produced from the previous script is a square matrix with two columns of meta-data appended on the far left. (sup_pop = super population, pop = population)
dist_mat_pd.head() sup_pop pop ACB ASW ... GIH ITU PJL STU
0 AFR ACB 0.000006 4.319538 ... 13.771366 13.447868 13.840908 13.605249
1 AFR ASW 4.319538 0.000000 ... 11.767316 11.552621 11.581944 11.916708
2 AFR ESN 3.248675 5.605990 ... 15.177879 14.626126 15.380010 14.665675
3 AFR GWD 3.705924 5.986606 ... 14.915595 14.495067 14.798274 14.546724
4 AFR LWK 3.790766 5.945784 ... 14.968611 14.537910 15.213237 14.698901
[5 rows x 28 columns]
The super populations are already in the table, but here they are in a list for reference:
sup_pop_orig = ["AFR","AFR","AFR","AFR","AFR","AFR","AFR",
"AMR","AMR","AMR","AMR",
"EAS","EAS","EAS","EAS","EAS",
"EUR","EUR","EUR","EUR","EUR",
"SAS","SAS","SAS","SAS","SAS"]I’ll create a column labeling super populations in the margin. This is a workaround b/c I’m having issues with row stubs.
sup_pop_alt = ["","","","AFR","","","",
"","AMR","","",
"","EAS","EAS","","",
"","","EUR","","",
"","","SAS","",""]Expanded background information about each population is shown below. I’ll edit this by hand to make further columns.
# full population details - for notes columns
pop_details = ['African Caribbean in Barbados', #
'African Ancestry in SW USA',
'Esan in Nigeria',
'Gambian in Western Division – Mandinka',
'Luhya in Webuye, Kenya',
'Mende in Sierra Leone',
'Yoruba in Ibadan, Nigeria',
'Colombian in Medellín, Colombia',
'Mexican Ancestry in L.A. CA, USA',
'Peruvian in Lima Peru',
'Puerto Rican in Puerto Rico',
'Chinese Dai in Xishuangbanna, China', #Chin
'Han Chinese in Beijing, China',
'Han Chinese - Southern, China',
'Japanese in Tokyo, Japan',
'Kinh in Ho Chi Minh City, Vietnam',
'Utah Residents (CEPH) with Northern and Western European ancestry',
'Finnish in Finland',
'British from England and Scotland',
'Iberian Populations in Spain',
'Toscani in Italia',
'Bengali in Bangladesh',
'Gujarati Indians in Houston, Texas, USA',
'Indian Telugu in the U.K.',
'Punjabi in Lahore, Pakistan [PJL]',
'Sri Lankan Tamil in the UK']Based on the information from the full population details, I assigned approximate ethnic, cultural, or similar groups that share demographic or cultural connections. These are approximate (I’m not a social scientist) and based only on information provided by the 1000 Genomes project. Feel free to contact me at brouwern at gmail.com with any suggested updates.
ethnic_group = [
'Afro-Caribbean',
'African American',
'Esan',
'Mandinka',
'Luhya',
'Mende',
'Yoruba',
'Colombian',
'Mexican-American',
'Peruvian',
'Puerto Rican',
'Dai', # China
'Han-North',
'Han-South',
'Japanese',
'Kinh', # in Ho Chi Minh City, Vietnam
'Euro-American',
'Finnish',
'British / Scottish',
'Iberian',
'Tuscan',
'Bengali',
'Gujarati', # Indians in Houston, Texas, USA
'Telugu', # in the U.K.
'Punjabi', # in Lahore, Pakistan [PJL]
'Tamil'] #Sri Lankan Tamil in the UKThis next list contains the country the samples were collected in, their historical geographic origins, or where the group has the largest populations. In several cases data were collected “in diaspora” and the geographic location assigned to the origin as reported by the 1000 Genomes Project. Therefore, Indians living in the UK were assigned to India. For African-American and European-Americans, “Africa” and “Europe” were used.
Note that the Dai were sampled from China. Wikipedia states “The Dai people form one of the 56 ethnic groups officially recognized by the People’s Republic of China. By extension, the term can apply to groups in Laos, Vietnam, Thailand and Myanmar when Dai is used to mean specifically Tai Yai, Lue, Chinese Shan, Tai Dam, Tai Khao or even Tai in general.” (en.Wikipedia.org/wiki/Dai_people) I therefore coded them as “SE-Asia”
geographic_sample = [
'Barbados',
'Africa', # sampled in USA
'Nigeria', #Nigeria - 1
'Gambia',
'Kenya',
'Sierra Leone',
'Nigeria', #Nigeria - 2
'Colombia',
'Mexico',
'Peru',
'P. Rico',
'SE-Asia',# Chinese Dai in Xishuangbanna
'China-NE',# Han Chinese - 1; Beijing
'China-SE',# Han Chinese - 2; Southern China
'Japan',
'Vietnam',
'W. Europe', #European Union
'Finland',
'UK',
'Spain',
'Italy',
'Bangladesh',
'Gujarat',#India Gujarat, sampled in UK
'Andhra Pradesh',#India Telugu, sampled in UK
'Punjab',# Punjab, Pakistan
'Sri Lanka']Finally, this list contains the country where a sample was actually obtained from.
# country where sample was taken
country_sample = [
'Barbados',
'USA', # African American
'Nigeria', #Nigeria - 1
'Gambia',
'Kenya',
'Sierra Leone',
'Nigeria', #Nigeria - 2
'Colombia',
'Mexico',
'Peru',
'Puerto Rico',
'China',# Chinese Dai in Xishuangbanna
'China',# Han Chinese - 1; Beijing
'China',# Han Chinese - 2; Southern China
'Japan',
'Vietnam',
'USA', #European Union
'Finland',
'UK',
'Spain',
'Italy',
'Bangladesh',
'UK',#India Gujarat, sampled in UK
'UK',#India Telugu, sampled in UK
'Pakistan',# Punjab
'UK']Names of flag files for use as icons.
For African Americans, the Pan-Africa frag is used, though there is likely a better choice.
Unresolved:
- For European Americans the European Union flag could be used used.
- Chinese Dai?
- Gujarat - Use state flag?
- Telugu - use state flag?
The Chin
# for inserting flags
icon = ['barbados.png',
'pan_africa_flag.png', # African Union? Pan-Africa
'nigeria.png', #Nigeria
'gambia.png',
'kenya.png',
'sierra_leone.png',
'nigeria.png', #Nigeria
'colombia.png',
'mexico.png',
'peru.png',
'puerto_rico.png',
'china.png',# Chinese Dai in Xishuangbanna
'china.png',# Han Chinese in Beijing
'china.png',# Han Chinese - Southern, China
'japan.png',
'vietnam.png',
'USA.png', #European Union
'finland.png',
'UK.png',
'spain.png',
'italy.png',
'bangladesh.png',
'india.png',#india Gujarat
'india.png',#India Telugu
'pakistan.png',#India Punjab
'sri_lanka.png']Finalizing distance matrix
Stored and Remove the meta-data columns.
# make copy
sup_pop_and_pop = dist_mat_pd[["sup_pop","pop"]]
# remove metadata
dist_mat_only = dist_mat_pd.drop(["sup_pop","pop"], axis = 1)
# save new column names
cols_dist_mat_only = dist_mat_only.columnsSet the columns to None. In case numbers are shown on the final table, I want the diagonal to be blank, which can be done in great_tables. This could also be done with numpy.diagonal, but in this case I knew that the diagonal elements are all 0 (or very very close).
#dist_mat_only[dist_mat_only< 0.01] = NoneScaling:
In order to increase the contrast based on color all elements are scaled relative to the maximum value observed
# find min/max for scaling
global_max = dist_mat_only.max().max()
dist_mat_only = (dist_mat_only)/global_max*100Work with upper triangle
Set off diagonal to 0
dist_mat_tri0 = np.tril(dist_mat_only)Reformat to dataframe
dist_mat_only2 = pd.DataFrame.from_records(dist_mat_tri0)Set values to None
dist_mat_only2[dist_mat_only2< 0.001] = NoneAdd the columns back
dist_mat_only2.columns = cols_dist_mat_onlyLooks like numpy un-rounded us
dist_mat_only2 = dist_mat_only2.round()dist_mat_only2.head() ACB ASW ESN GWD LWK MSL YRI ... IBS TSI BEB GIH ITU PJL STU
0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
1 18.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
2 14.0 23.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
3 15.0 25.0 19.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
4 16.0 25.0 16.0 21.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
[5 rows x 26 columns]
location_plus_space =[i + j for i, j in zip(geographic_sample,
np.repeat("<br>",dist_mat_pd.shape[0]))]
location_plus_code = [i + j for i, j in zip(location_plus_space,
cols_dist_mat_only)]Add meta-data
dist_mat_only2["oneKG_pop"] = cols_dist_mat_only
dist_mat_only2["location"] = geographic_sample #
dist_mat_only2["location_combo"] = location_plus_code #
dist_mat_only2["pop_details"] = pop_details
dist_mat_only2["ethnic_cultural_group"] = ethnic_group
dist_mat_only2["sup_pop"] = sup_pop_orig
dist_mat_only2["sup_pop_alt"] = sup_pop_alt
dist_mat_only2["icon"] = iconInserting empty rows
I experimented with this but did not use in the final table
filler_row = pd.DataFrame(np.repeat(None,dist_mat_only2.shape[1])).T
filler_row.columns = dist_mat_pd.columns
dist_mat_only3 = pd.concat([dist_mat_only2.iloc[:7],
filler_row,
dist_mat_only2.iloc[7:11],
filler_row,
dist_mat_only2.iloc[11:15],
filler_row,
dist_mat_only2.iloc[16:20],
filler_row,
dist_mat_only2.iloc[21:]
]).reset_index(drop=True)great_tables groupings etc.
The series below define key groups of rows / columns.
# columns that will be numeric in table
num_cols = ['ACB','ASW', 'ESN', 'GWD', 'LWK', 'MSL', 'YRI',
'CLM', 'MXL', 'PEL', 'PUR',
'CDX', 'CHB', 'CHS', 'JPT', 'KHV',
'CEU','FIN','GBR','IBS','TSI',
'BEB', 'GIH', 'ITU', 'PJL', 'STU']
# super population groups
## for creating spanners in table
africa_AFR = ['ACB','ASW','ESN','GWD','LWK','MSL','YRI']
americas_AMR = ['CLM','MXL','PEL','PUR']
east_asia_EAS = ['CDX','CHB','CHS','JPT','KHV']
europe_EUR = ['CEU','FIN','GBR','IBS','TSI']
southasia_SAS = ['BEB','GIH','ITU','PJL','STU']Preview the full table with bare call to GT()
GT(dist_mat_only2)| ACB | ASW | ESN | GWD | LWK | MSL | YRI | CLM | MXL | PEL | PUR | CDX | CHB | CHS | JPT | KHV | CEU | FIN | GBR | IBS | TSI | BEB | GIH | ITU | PJL | STU | oneKG_pop | location | location_combo | pop_details | ethnic_cultural_group | sup_pop | sup_pop_alt | icon |
| ACB | Barbados | Barbados ACB |
African Caribbean in Barbados | Afro-Caribbean | AFR | barbados.png | |||||||||||||||||||||||||||
| 18.0 | ASW | Africa | Africa ASW |
African Ancestry in SW USA | African American | AFR | pan_africa_flag.png | ||||||||||||||||||||||||||
| 14.0 | 23.0 | ESN | Nigeria | Nigeria ESN |
Esan in Nigeria | Esan | AFR | nigeria.png | |||||||||||||||||||||||||
| 15.0 | 25.0 | 19.0 | GWD | Gambia | Gambia GWD |
Gambian in Western Division – Mandinka | Mandinka | AFR | AFR | gambia.png | |||||||||||||||||||||||
| 16.0 | 25.0 | 16.0 | 21.0 | LWK | Kenya | Kenya LWK |
Luhya in Webuye, Kenya | Luhya | AFR | kenya.png | |||||||||||||||||||||||
| 17.0 | 25.0 | 18.0 | 17.0 | 19.0 | MSL | Sierra Leone | Sierra Leone MSL |
Mende in Sierra Leone | Mende | AFR | sierra_leone.png | ||||||||||||||||||||||
| 16.0 | 24.0 | 15.0 | 19.0 | 19.0 | 16.0 | YRI | Nigeria | Nigeria YRI |
Yoruba in Ibadan, Nigeria | Yoruba | AFR | nigeria.png | |||||||||||||||||||||
| 63.0 | 53.0 | 70.0 | 67.0 | 69.0 | 72.0 | 71.0 | CLM | Colombia | Colombia CLM |
Colombian in Medellín, Colombia | Colombian | AMR | colombia.png | ||||||||||||||||||||
| 61.0 | 54.0 | 68.0 | 66.0 | 67.0 | 70.0 | 69.0 | 22.0 | MXL | Mexico | Mexico MXL |
Mexican Ancestry in L.A. CA, USA | Mexican-American | AMR | AMR | mexico.png | ||||||||||||||||||
| 64.0 | 58.0 | 71.0 | 68.0 | 69.0 | 72.0 | 71.0 | 28.0 | 15.0 | PEL | Peru | Peru PEL |
Peruvian in Lima Peru | Peruvian | AMR | peru.png | ||||||||||||||||||
| 60.0 | 49.0 | 67.0 | 64.0 | 67.0 | 69.0 | 68.0 | 14.0 | 27.0 | 34.0 | PUR | P. Rico | P. Rico PUR |
Puerto Rican in Puerto Rico | Puerto Rican | AMR | puerto_rico.png | |||||||||||||||||
| 77.0 | 77.0 | 81.0 | 81.0 | 79.0 | 83.0 | 84.0 | 68.0 | 52.0 | 56.0 | 71.0 | CDX | SE-Asia | SE-Asia CDX |
Chinese Dai in Xishuangbanna, China | Dai | EAS | china.png | ||||||||||||||||
| 77.0 | 77.0 | 82.0 | 81.0 | 80.0 | 83.0 | 84.0 | 69.0 | 53.0 | 57.0 | 72.0 | 16.0 | CHB | China-NE | China-NE CHB |
Han Chinese in Beijing, China | Han-North | EAS | EAS | china.png | ||||||||||||||
| 77.0 | 76.0 | 82.0 | 81.0 | 79.0 | 83.0 | 83.0 | 67.0 | 51.0 | 56.0 | 70.0 | 11.0 | 8.0 | CHS | China-SE | China-SE CHS |
Han Chinese - Southern, China | Han-South | EAS | EAS | china.png | |||||||||||||
| 82.0 | 83.0 | 86.0 | 85.0 | 84.0 | 88.0 | 88.0 | 75.0 | 59.0 | 63.0 | 79.0 | 21.0 | 16.0 | 16.0 | JPT | Japan | Japan JPT |
Japanese in Tokyo, Japan | Japanese | EAS | japan.png | |||||||||||||
| 75.0 | 74.0 | 79.0 | 78.0 | 77.0 | 81.0 | 81.0 | 66.0 | 51.0 | 56.0 | 69.0 | 8.0 | 14.0 | 10.0 | 21.0 | KHV | Vietnam | Vietnam KHV |
Kinh in Ho Chi Minh City, Vietnam | Kinh | EAS | vietnam.png | ||||||||||||
| 77.0 | 66.0 | 83.0 | 81.0 | 84.0 | 86.0 | 86.0 | 27.0 | 43.0 | 48.0 | 25.0 | 87.0 | 88.0 | 86.0 | 95.0 | 85.0 | CEU | W. Europe | W. Europe CEU |
Utah Residents (CEPH) with Northern and Western European ancestry | Euro-American | EUR | USA.png | |||||||||||
| 82.0 | 72.0 | 88.0 | 85.0 | 89.0 | 91.0 | 91.0 | 37.0 | 52.0 | 54.0 | 38.0 | 93.0 | 95.0 | 93.0 | 100.0 | 92.0 | 18.0 | FIN | Finland | Finland FIN |
Finnish in Finland | Finnish | EUR | finland.png | ||||||||||
| 76.0 | 65.0 | 82.0 | 79.0 | 83.0 | 85.0 | 84.0 | 27.0 | 43.0 | 46.0 | 26.0 | 88.0 | 89.0 | 87.0 | 95.0 | 86.0 | 9.0 | 16.0 | GBR | UK | UK GBR |
British from England and Scotland | British / Scottish | EUR | EUR | UK.png | ||||||||
| 69.0 | 58.0 | 76.0 | 73.0 | 76.0 | 78.0 | 77.0 | 17.0 | 31.0 | 39.0 | 14.0 | 75.0 | 76.0 | 74.0 | 83.0 | 73.0 | 18.0 | 33.0 | 21.0 | IBS | Spain | Spain IBS |
Iberian Populations in Spain | Iberian | EUR | spain.png | ||||||||
| 69.0 | 58.0 | 76.0 | 74.0 | 76.0 | 78.0 | 78.0 | 19.0 | 35.0 | 40.0 | 15.0 | 79.0 | 79.0 | 78.0 | 87.0 | 77.0 | 16.0 | 30.0 | 19.0 | 10.0 | TSI | Italy | Italy TSI |
Toscani in Italia | Tuscan | EUR | italy.png | |||||||
| 56.0 | 48.0 | 61.0 | 60.0 | 61.0 | 62.0 | 60.0 | 38.0 | 36.0 | 40.0 | 37.0 | 68.0 | 65.0 | 65.0 | 72.0 | 64.0 | 55.0 | 66.0 | 55.0 | 43.0 | 45.0 | BEB | Bangladesh | Bangladesh BEB |
Bengali in Bangladesh | Bengali | SAS | bangladesh.png | ||||||
| 57.0 | 49.0 | 63.0 | 62.0 | 62.0 | 64.0 | 62.0 | 39.0 | 36.0 | 42.0 | 36.0 | 65.0 | 61.0 | 62.0 | 71.0 | 61.0 | 55.0 | 68.0 | 56.0 | 42.0 | 43.0 | 15.0 | GIH | Gujarat | Gujarat GIH |
Gujarati Indians in Houston, Texas, USA | Gujarati | SAS | india.png | |||||
| 56.0 | 48.0 | 61.0 | 60.0 | 61.0 | 61.0 | 59.0 | 45.0 | 43.0 | 46.0 | 43.0 | 72.0 | 69.0 | 70.0 | 77.0 | 69.0 | 61.0 | 73.0 | 61.0 | 50.0 | 51.0 | 11.0 | 18.0 | ITU | Andhra Pradesh | Andhra Pradesh ITU |
Indian Telugu in the U.K. | Telugu | SAS | SAS | india.png | |||
| 58.0 | 48.0 | 64.0 | 62.0 | 63.0 | 65.0 | 63.0 | 26.0 | 28.0 | 33.0 | 25.0 | 67.0 | 66.0 | 65.0 | 74.0 | 65.0 | 44.0 | 55.0 | 43.0 | 32.0 | 33.0 | 19.0 | 20.0 | 24.0 | PJL | Punjab | Punjab PJL |
Punjabi in Lahore, Pakistan [PJL] | Punjabi | SAS | pakistan.png | |||
| 57.0 | 50.0 | 61.0 | 61.0 | 61.0 | 61.0 | 59.0 | 47.0 | 45.0 | 48.0 | 46.0 | 72.0 | 68.0 | 69.0 | 76.0 | 69.0 | 64.0 | 75.0 | 64.0 | 53.0 | 54.0 | 12.0 | 20.0 | 9.0 | 27.0 | STU | Sri Lanka | Sri Lanka STU |
Sri Lankan Tamil in the UK | Tamil | SAS | sri_lanka.png |
Hidden columns:
- ‘location’
- ‘pop_details’
- ‘sup_pop’ (from sup_pop_orig)
- ‘sup_pop_alt’
Used:
- ‘icon’
- ‘oneKG_pop’
- ‘ethnic_cultural_group’
Preview the table with all major elements (not rendered)
from great_tables import GT, md, html, style, loc
# make my table
(
GT(dist_mat_only2).tab_style(
style=style.text(size = "0"),
locations=loc.body(columns=num_cols)
)
.tab_header(title="Genetic distance between 1000 Genomes populations",
subtitle="Subtitle goes here"
)
.cols_hide(columns=[
"location", previously "country"
"pop_details",
"sup_pop",
"sup_pop_alt"])
.fmt_image("icon", path="icons/")
#.tab_stub(
# rowname_col="row",
# groupname_col="sup_pop"
#)
.tab_spanner(
label="Group",
columns=[
#"sup_pop",
"oneKG_pop",
"icon",
"location"
]
)
.tab_spanner(
label="Africa",
columns=africa_AFR
)
.tab_spanner(
label="Americas",
columns=americas_AMR
)
.tab_spanner(
label="East Asia",
columns=east_asia_EAS
)
.tab_spanner(
label="Western Europe",
columns=europe_EUR
)
.tab_spanner(
label="South Asia",
columns=southasia_SAS
)
.tab_spanner(
label="Genetic Distances",
columns=num_cols
)
.tab_spanner(
label="Notes",
columns=[
"ethnic_cultural_group"#,
#"country_sampling"
]
)
#.cols_move_to_start(columns=[#"sup_pop",
# "location",
# "oneKG_pop",
# "icon"])
#.cols_label(
# #sup_pop = "Super population",
# oneKG_pop = "Population Code",
# #country_sampling = "Sampling location",
# ethnic_cultural_group = "Ethnic / <br> Cultural Group",
# icon = "Geographic origin"
#)
.cols_align(align="center")
.fmt_number(columns=num_cols, decimals=0, use_seps=False)
#.sub_zero(zero_text='nil')
.data_color(
domain=[90, 0],
palette=["white", "orange"],
na_color="white",
)
.tab_source_note(source_note="Add citation")
)Save the final file
# Save final table
dist_mat_only2.to_csv("dist_mat_final.csv", index= None)