library(reticulate)
#py_install("pandas")
#py_install("polars")
#py_install("seaborn")
Final Data Preparation
Introduction
Data for this project was processed from the original download in .gz
format. This script makes final changes related to formatting the data for the table, including setting adding metadata columns.
Preliminaries
Load reticulated so I can work in Python while in RStudio:
Load the Python packages:
#import polars as pl
import pandas as pd
import numpy as np
from great_tables import GT, md, html, style, loc
Load the distance matrix. The _pd suffix is a reminder that I’m working in Pandas. Eventually I hope to port as much of these final tasks to Polars.
= pd.read_csv("distance_matrix.csv") dist_mat_pd
Finalizing dataframe
Additional meta-data
The dataframe produced from the previous script is a square matrix with two columns of meta-data appended on the far left. (sup_pop = super population, pop = population)
dist_mat_pd.head()
sup_pop pop ACB ASW ... GIH ITU PJL STU
0 AFR ACB 0.000006 4.319538 ... 13.771366 13.447868 13.840908 13.605249
1 AFR ASW 4.319538 0.000000 ... 11.767316 11.552621 11.581944 11.916708
2 AFR ESN 3.248675 5.605990 ... 15.177879 14.626126 15.380010 14.665675
3 AFR GWD 3.705924 5.986606 ... 14.915595 14.495067 14.798274 14.546724
4 AFR LWK 3.790766 5.945784 ... 14.968611 14.537910 15.213237 14.698901
[5 rows x 28 columns]
The super populations are already in the table, but here they are in a list for reference:
= ["AFR","AFR","AFR","AFR","AFR","AFR","AFR",
sup_pop_orig "AMR","AMR","AMR","AMR",
"EAS","EAS","EAS","EAS","EAS",
"EUR","EUR","EUR","EUR","EUR",
"SAS","SAS","SAS","SAS","SAS"]
I’ll create a column labeling super populations in the margin. This is a workaround b/c I’m having issues with row stubs.
= ["","","","AFR","","","",
sup_pop_alt "","AMR","","",
"","EAS","EAS","","",
"","","EUR","","",
"","","SAS","",""]
Expanded background information about each population is shown below. I’ll edit this by hand to make further columns.
# full population details - for notes columns
= ['African Caribbean in Barbados', #
pop_details 'African Ancestry in SW USA',
'Esan in Nigeria',
'Gambian in Western Division – Mandinka',
'Luhya in Webuye, Kenya',
'Mende in Sierra Leone',
'Yoruba in Ibadan, Nigeria',
'Colombian in Medellín, Colombia',
'Mexican Ancestry in L.A. CA, USA',
'Peruvian in Lima Peru',
'Puerto Rican in Puerto Rico',
'Chinese Dai in Xishuangbanna, China', #Chin
'Han Chinese in Beijing, China',
'Han Chinese - Southern, China',
'Japanese in Tokyo, Japan',
'Kinh in Ho Chi Minh City, Vietnam',
'Utah Residents (CEPH) with Northern and Western European ancestry',
'Finnish in Finland',
'British from England and Scotland',
'Iberian Populations in Spain',
'Toscani in Italia',
'Bengali in Bangladesh',
'Gujarati Indians in Houston, Texas, USA',
'Indian Telugu in the U.K.',
'Punjabi in Lahore, Pakistan [PJL]',
'Sri Lankan Tamil in the UK']
Based on the information from the full population details, I assigned approximate ethnic, cultural, or similar groups that share demographic or cultural connections. These are approximate (I’m not a social scientist) and based only on information provided by the 1000 Genomes project. Feel free to contact me at brouwern at gmail.com with any suggested updates.
= [
ethnic_group 'Afro-Caribbean',
'African American',
'Esan',
'Mandinka',
'Luhya',
'Mende',
'Yoruba',
'Colombian',
'Mexican-American',
'Peruvian',
'Puerto Rican',
'Dai', # China
'Han-North',
'Han-South',
'Japanese',
'Kinh', # in Ho Chi Minh City, Vietnam
'Euro-American',
'Finnish',
'British / Scottish',
'Iberian',
'Tuscan',
'Bengali',
'Gujarati', # Indians in Houston, Texas, USA
'Telugu', # in the U.K.
'Punjabi', # in Lahore, Pakistan [PJL]
'Tamil'] #Sri Lankan Tamil in the UK
This next list contains the country the samples were collected in, their historical geographic origins, or where the group has the largest populations. In several cases data were collected “in diaspora” and the geographic location assigned to the origin as reported by the 1000 Genomes Project. Therefore, Indians living in the UK were assigned to India. For African-American and European-Americans, “Africa” and “Europe” were used.
Note that the Dai were sampled from China. Wikipedia states “The Dai people form one of the 56 ethnic groups officially recognized by the People’s Republic of China. By extension, the term can apply to groups in Laos, Vietnam, Thailand and Myanmar when Dai is used to mean specifically Tai Yai, Lue, Chinese Shan, Tai Dam, Tai Khao or even Tai in general.” (en.Wikipedia.org/wiki/Dai_people) I therefore coded them as “SE-Asia”
= [
geographic_sample 'Barbados',
'Africa', # sampled in USA
'Nigeria', #Nigeria - 1
'Gambia',
'Kenya',
'Sierra Leone',
'Nigeria', #Nigeria - 2
'Colombia',
'Mexico',
'Peru',
'P. Rico',
'SE-Asia',# Chinese Dai in Xishuangbanna
'China-NE',# Han Chinese - 1; Beijing
'China-SE',# Han Chinese - 2; Southern China
'Japan',
'Vietnam',
'W. Europe', #European Union
'Finland',
'UK',
'Spain',
'Italy',
'Bangladesh',
'Gujarat',#India Gujarat, sampled in UK
'Andhra Pradesh',#India Telugu, sampled in UK
'Punjab',# Punjab, Pakistan
'Sri Lanka']
Finally, this list contains the country where a sample was actually obtained from.
# country where sample was taken
= [
country_sample 'Barbados',
'USA', # African American
'Nigeria', #Nigeria - 1
'Gambia',
'Kenya',
'Sierra Leone',
'Nigeria', #Nigeria - 2
'Colombia',
'Mexico',
'Peru',
'Puerto Rico',
'China',# Chinese Dai in Xishuangbanna
'China',# Han Chinese - 1; Beijing
'China',# Han Chinese - 2; Southern China
'Japan',
'Vietnam',
'USA', #European Union
'Finland',
'UK',
'Spain',
'Italy',
'Bangladesh',
'UK',#India Gujarat, sampled in UK
'UK',#India Telugu, sampled in UK
'Pakistan',# Punjab
'UK']
Names of flag files for use as icons.
For African Americans, the Pan-Africa frag is used, though there is likely a better choice.
Unresolved:
- For European Americans the European Union flag could be used used.
- Chinese Dai?
- Gujarat - Use state flag?
- Telugu - use state flag?
The Chin
# for inserting flags
= ['barbados.png',
icon 'pan_africa_flag.png', # African Union? Pan-Africa
'nigeria.png', #Nigeria
'gambia.png',
'kenya.png',
'sierra_leone.png',
'nigeria.png', #Nigeria
'colombia.png',
'mexico.png',
'peru.png',
'puerto_rico.png',
'china.png',# Chinese Dai in Xishuangbanna
'china.png',# Han Chinese in Beijing
'china.png',# Han Chinese - Southern, China
'japan.png',
'vietnam.png',
'USA.png', #European Union
'finland.png',
'UK.png',
'spain.png',
'italy.png',
'bangladesh.png',
'india.png',#india Gujarat
'india.png',#India Telugu
'pakistan.png',#India Punjab
'sri_lanka.png']
Finalizing distance matrix
Stored and Remove the meta-data columns.
# make copy
= dist_mat_pd[["sup_pop","pop"]]
sup_pop_and_pop
# remove metadata
= dist_mat_pd.drop(["sup_pop","pop"], axis = 1)
dist_mat_only
# save new column names
= dist_mat_only.columns cols_dist_mat_only
Set the columns to None. In case numbers are shown on the final table, I want the diagonal to be blank, which can be done in great_tables. This could also be done with numpy.diagonal, but in this case I knew that the diagonal elements are all 0 (or very very close).
#dist_mat_only[dist_mat_only< 0.01] = None
Scaling:
In order to increase the contrast based on color all elements are scaled relative to the maximum value observed
# find min/max for scaling
= dist_mat_only.max().max()
global_max = (dist_mat_only)/global_max*100 dist_mat_only
Work with upper triangle
Set off diagonal to 0
= np.tril(dist_mat_only) dist_mat_tri0
Reformat to dataframe
= pd.DataFrame.from_records(dist_mat_tri0) dist_mat_only2
Set values to None
< 0.001] = None dist_mat_only2[dist_mat_only2
Add the columns back
= cols_dist_mat_only dist_mat_only2.columns
Looks like numpy un-rounded us
= dist_mat_only2.round() dist_mat_only2
dist_mat_only2.head()
ACB ASW ESN GWD LWK MSL YRI ... IBS TSI BEB GIH ITU PJL STU
0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
1 18.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
2 14.0 23.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
3 15.0 25.0 19.0 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
4 16.0 25.0 16.0 21.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
[5 rows x 26 columns]
=[i + j for i, j in zip(geographic_sample,
location_plus_space "<br>",dist_mat_pd.shape[0]))]
np.repeat(
= [i + j for i, j in zip(location_plus_space,
location_plus_code cols_dist_mat_only)]
Add meta-data
"oneKG_pop"] = cols_dist_mat_only
dist_mat_only2["location"] = geographic_sample #
dist_mat_only2["location_combo"] = location_plus_code #
dist_mat_only2["pop_details"] = pop_details
dist_mat_only2["ethnic_cultural_group"] = ethnic_group
dist_mat_only2["sup_pop"] = sup_pop_orig
dist_mat_only2["sup_pop_alt"] = sup_pop_alt
dist_mat_only2["icon"] = icon dist_mat_only2[
Inserting empty rows
I experimented with this but did not use in the final table
= pd.DataFrame(np.repeat(None,dist_mat_only2.shape[1])).T
filler_row = dist_mat_pd.columns
filler_row.columns = pd.concat([dist_mat_only2.iloc[:7],
dist_mat_only3
filler_row, 7:11],
dist_mat_only2.iloc[
filler_row, 11:15],
dist_mat_only2.iloc[
filler_row, 16:20],
dist_mat_only2.iloc[
filler_row, 21:]
dist_mat_only2.iloc[=True) ]).reset_index(drop
great_tables groupings etc.
The series below define key groups of rows / columns.
# columns that will be numeric in table
= ['ACB','ASW', 'ESN', 'GWD', 'LWK', 'MSL', 'YRI',
num_cols 'CLM', 'MXL', 'PEL', 'PUR',
'CDX', 'CHB', 'CHS', 'JPT', 'KHV',
'CEU','FIN','GBR','IBS','TSI',
'BEB', 'GIH', 'ITU', 'PJL', 'STU']
# super population groups
## for creating spanners in table
= ['ACB','ASW','ESN','GWD','LWK','MSL','YRI']
africa_AFR = ['CLM','MXL','PEL','PUR']
americas_AMR = ['CDX','CHB','CHS','JPT','KHV']
east_asia_EAS = ['CEU','FIN','GBR','IBS','TSI']
europe_EUR = ['BEB','GIH','ITU','PJL','STU'] southasia_SAS
Preview the full table with bare call to GT()
GT(dist_mat_only2)
ACB | ASW | ESN | GWD | LWK | MSL | YRI | CLM | MXL | PEL | PUR | CDX | CHB | CHS | JPT | KHV | CEU | FIN | GBR | IBS | TSI | BEB | GIH | ITU | PJL | STU | oneKG_pop | location | location_combo | pop_details | ethnic_cultural_group | sup_pop | sup_pop_alt | icon |
ACB | Barbados | Barbados ACB |
African Caribbean in Barbados | Afro-Caribbean | AFR | barbados.png | |||||||||||||||||||||||||||
18.0 | ASW | Africa | Africa ASW |
African Ancestry in SW USA | African American | AFR | pan_africa_flag.png | ||||||||||||||||||||||||||
14.0 | 23.0 | ESN | Nigeria | Nigeria ESN |
Esan in Nigeria | Esan | AFR | nigeria.png | |||||||||||||||||||||||||
15.0 | 25.0 | 19.0 | GWD | Gambia | Gambia GWD |
Gambian in Western Division – Mandinka | Mandinka | AFR | AFR | gambia.png | |||||||||||||||||||||||
16.0 | 25.0 | 16.0 | 21.0 | LWK | Kenya | Kenya LWK |
Luhya in Webuye, Kenya | Luhya | AFR | kenya.png | |||||||||||||||||||||||
17.0 | 25.0 | 18.0 | 17.0 | 19.0 | MSL | Sierra Leone | Sierra Leone MSL |
Mende in Sierra Leone | Mende | AFR | sierra_leone.png | ||||||||||||||||||||||
16.0 | 24.0 | 15.0 | 19.0 | 19.0 | 16.0 | YRI | Nigeria | Nigeria YRI |
Yoruba in Ibadan, Nigeria | Yoruba | AFR | nigeria.png | |||||||||||||||||||||
63.0 | 53.0 | 70.0 | 67.0 | 69.0 | 72.0 | 71.0 | CLM | Colombia | Colombia CLM |
Colombian in Medellín, Colombia | Colombian | AMR | colombia.png | ||||||||||||||||||||
61.0 | 54.0 | 68.0 | 66.0 | 67.0 | 70.0 | 69.0 | 22.0 | MXL | Mexico | Mexico MXL |
Mexican Ancestry in L.A. CA, USA | Mexican-American | AMR | AMR | mexico.png | ||||||||||||||||||
64.0 | 58.0 | 71.0 | 68.0 | 69.0 | 72.0 | 71.0 | 28.0 | 15.0 | PEL | Peru | Peru PEL |
Peruvian in Lima Peru | Peruvian | AMR | peru.png | ||||||||||||||||||
60.0 | 49.0 | 67.0 | 64.0 | 67.0 | 69.0 | 68.0 | 14.0 | 27.0 | 34.0 | PUR | P. Rico | P. Rico PUR |
Puerto Rican in Puerto Rico | Puerto Rican | AMR | puerto_rico.png | |||||||||||||||||
77.0 | 77.0 | 81.0 | 81.0 | 79.0 | 83.0 | 84.0 | 68.0 | 52.0 | 56.0 | 71.0 | CDX | SE-Asia | SE-Asia CDX |
Chinese Dai in Xishuangbanna, China | Dai | EAS | china.png | ||||||||||||||||
77.0 | 77.0 | 82.0 | 81.0 | 80.0 | 83.0 | 84.0 | 69.0 | 53.0 | 57.0 | 72.0 | 16.0 | CHB | China-NE | China-NE CHB |
Han Chinese in Beijing, China | Han-North | EAS | EAS | china.png | ||||||||||||||
77.0 | 76.0 | 82.0 | 81.0 | 79.0 | 83.0 | 83.0 | 67.0 | 51.0 | 56.0 | 70.0 | 11.0 | 8.0 | CHS | China-SE | China-SE CHS |
Han Chinese - Southern, China | Han-South | EAS | EAS | china.png | |||||||||||||
82.0 | 83.0 | 86.0 | 85.0 | 84.0 | 88.0 | 88.0 | 75.0 | 59.0 | 63.0 | 79.0 | 21.0 | 16.0 | 16.0 | JPT | Japan | Japan JPT |
Japanese in Tokyo, Japan | Japanese | EAS | japan.png | |||||||||||||
75.0 | 74.0 | 79.0 | 78.0 | 77.0 | 81.0 | 81.0 | 66.0 | 51.0 | 56.0 | 69.0 | 8.0 | 14.0 | 10.0 | 21.0 | KHV | Vietnam | Vietnam KHV |
Kinh in Ho Chi Minh City, Vietnam | Kinh | EAS | vietnam.png | ||||||||||||
77.0 | 66.0 | 83.0 | 81.0 | 84.0 | 86.0 | 86.0 | 27.0 | 43.0 | 48.0 | 25.0 | 87.0 | 88.0 | 86.0 | 95.0 | 85.0 | CEU | W. Europe | W. Europe CEU |
Utah Residents (CEPH) with Northern and Western European ancestry | Euro-American | EUR | USA.png | |||||||||||
82.0 | 72.0 | 88.0 | 85.0 | 89.0 | 91.0 | 91.0 | 37.0 | 52.0 | 54.0 | 38.0 | 93.0 | 95.0 | 93.0 | 100.0 | 92.0 | 18.0 | FIN | Finland | Finland FIN |
Finnish in Finland | Finnish | EUR | finland.png | ||||||||||
76.0 | 65.0 | 82.0 | 79.0 | 83.0 | 85.0 | 84.0 | 27.0 | 43.0 | 46.0 | 26.0 | 88.0 | 89.0 | 87.0 | 95.0 | 86.0 | 9.0 | 16.0 | GBR | UK | UK GBR |
British from England and Scotland | British / Scottish | EUR | EUR | UK.png | ||||||||
69.0 | 58.0 | 76.0 | 73.0 | 76.0 | 78.0 | 77.0 | 17.0 | 31.0 | 39.0 | 14.0 | 75.0 | 76.0 | 74.0 | 83.0 | 73.0 | 18.0 | 33.0 | 21.0 | IBS | Spain | Spain IBS |
Iberian Populations in Spain | Iberian | EUR | spain.png | ||||||||
69.0 | 58.0 | 76.0 | 74.0 | 76.0 | 78.0 | 78.0 | 19.0 | 35.0 | 40.0 | 15.0 | 79.0 | 79.0 | 78.0 | 87.0 | 77.0 | 16.0 | 30.0 | 19.0 | 10.0 | TSI | Italy | Italy TSI |
Toscani in Italia | Tuscan | EUR | italy.png | |||||||
56.0 | 48.0 | 61.0 | 60.0 | 61.0 | 62.0 | 60.0 | 38.0 | 36.0 | 40.0 | 37.0 | 68.0 | 65.0 | 65.0 | 72.0 | 64.0 | 55.0 | 66.0 | 55.0 | 43.0 | 45.0 | BEB | Bangladesh | Bangladesh BEB |
Bengali in Bangladesh | Bengali | SAS | bangladesh.png | ||||||
57.0 | 49.0 | 63.0 | 62.0 | 62.0 | 64.0 | 62.0 | 39.0 | 36.0 | 42.0 | 36.0 | 65.0 | 61.0 | 62.0 | 71.0 | 61.0 | 55.0 | 68.0 | 56.0 | 42.0 | 43.0 | 15.0 | GIH | Gujarat | Gujarat GIH |
Gujarati Indians in Houston, Texas, USA | Gujarati | SAS | india.png | |||||
56.0 | 48.0 | 61.0 | 60.0 | 61.0 | 61.0 | 59.0 | 45.0 | 43.0 | 46.0 | 43.0 | 72.0 | 69.0 | 70.0 | 77.0 | 69.0 | 61.0 | 73.0 | 61.0 | 50.0 | 51.0 | 11.0 | 18.0 | ITU | Andhra Pradesh | Andhra Pradesh ITU |
Indian Telugu in the U.K. | Telugu | SAS | SAS | india.png | |||
58.0 | 48.0 | 64.0 | 62.0 | 63.0 | 65.0 | 63.0 | 26.0 | 28.0 | 33.0 | 25.0 | 67.0 | 66.0 | 65.0 | 74.0 | 65.0 | 44.0 | 55.0 | 43.0 | 32.0 | 33.0 | 19.0 | 20.0 | 24.0 | PJL | Punjab | Punjab PJL |
Punjabi in Lahore, Pakistan [PJL] | Punjabi | SAS | pakistan.png | |||
57.0 | 50.0 | 61.0 | 61.0 | 61.0 | 61.0 | 59.0 | 47.0 | 45.0 | 48.0 | 46.0 | 72.0 | 68.0 | 69.0 | 76.0 | 69.0 | 64.0 | 75.0 | 64.0 | 53.0 | 54.0 | 12.0 | 20.0 | 9.0 | 27.0 | STU | Sri Lanka | Sri Lanka STU |
Sri Lankan Tamil in the UK | Tamil | SAS | sri_lanka.png |
Hidden columns:
- ‘location’
- ‘pop_details’
- ‘sup_pop’ (from sup_pop_orig)
- ‘sup_pop_alt’
Used:
- ‘icon’
- ‘oneKG_pop’
- ‘ethnic_cultural_group’
Preview the table with all major elements (not rendered)
from great_tables import GT, md, html, style, loc
# make my table
(
GT(dist_mat_only2).tab_style(=style.text(size = "0"),
style=loc.body(columns=num_cols)
locations
)="Genetic distance between 1000 Genomes populations",
.tab_header(title="Subtitle goes here"
subtitle
)=[
.cols_hide(columns"location", previously "country"
"pop_details",
"sup_pop",
"sup_pop_alt"])
"icon", path="icons/")
.fmt_image(#.tab_stub(
# rowname_col="row",
# groupname_col="sup_pop"
#)
.tab_spanner(="Group",
label=[
columns#"sup_pop",
"oneKG_pop",
"icon",
"location"
]
)
.tab_spanner(="Africa",
label=africa_AFR
columns
)
.tab_spanner(="Americas",
label=americas_AMR
columns
)
.tab_spanner(="East Asia",
label=east_asia_EAS
columns
)
.tab_spanner(="Western Europe",
label=europe_EUR
columns
)
.tab_spanner(="South Asia",
label=southasia_SAS
columns
)
.tab_spanner(="Genetic Distances",
label=num_cols
columns
)
.tab_spanner(="Notes",
label=[
columns"ethnic_cultural_group"#,
#"country_sampling"
]
)#.cols_move_to_start(columns=[#"sup_pop",
# "location",
# "oneKG_pop",
# "icon"])
#.cols_label(
# #sup_pop = "Super population",
# oneKG_pop = "Population Code",
# #country_sampling = "Sampling location",
# ethnic_cultural_group = "Ethnic / <br> Cultural Group",
# icon = "Geographic origin"
#)
="center")
.cols_align(align=num_cols, decimals=0, use_seps=False)
.fmt_number(columns#.sub_zero(zero_text='nil')
.data_color(=[90, 0],
domain=["white", "orange"],
palette="white",
na_color
)="Add citation")
.tab_source_note(source_note
)
Save the final file
# Save final table
"dist_mat_final.csv", index= None) dist_mat_only2.to_csv(