Final Data Preparation

Author

Nathan Brouwer (brouwern@gmail.com)

Introduction

Data for this project was processed from the original download in .gz format. This script makes final changes related to formatting the data for the table, including setting adding metadata columns.

Preliminaries

Load reticulated so I can work in Python while in RStudio:

library(reticulate)
#py_install("pandas")
#py_install("polars")
#py_install("seaborn")

Load the Python packages:

#import polars as pl
import pandas as pd
import numpy as np
from great_tables import GT, md, html, style, loc

Load the distance matrix. The _pd suffix is a reminder that I’m working in Pandas. Eventually I hope to port as much of these final tasks to Polars.

dist_mat_pd = pd.read_csv("distance_matrix.csv")

Finalizing dataframe

Additional meta-data

The dataframe produced from the previous script is a square matrix with two columns of meta-data appended on the far left. (sup_pop = super population, pop = population)

dist_mat_pd.head()

  sup_pop  pop       ACB       ASW  ...        GIH        ITU        PJL        STU
0     AFR  ACB  0.000006  4.319538  ...  13.771366  13.447868  13.840908  13.605249
1     AFR  ASW  4.319538  0.000000  ...  11.767316  11.552621  11.581944  11.916708
2     AFR  ESN  3.248675  5.605990  ...  15.177879  14.626126  15.380010  14.665675
3     AFR  GWD  3.705924  5.986606  ...  14.915595  14.495067  14.798274  14.546724
4     AFR  LWK  3.790766  5.945784  ...  14.968611  14.537910  15.213237  14.698901

[5 rows x 28 columns]

The super populations are already in the table, but here they are in a list for reference:

sup_pop_orig = ["AFR","AFR","AFR","AFR","AFR","AFR","AFR",
                "AMR","AMR","AMR","AMR",
                "EAS","EAS","EAS","EAS","EAS",
                "EUR","EUR","EUR","EUR","EUR",
                "SAS","SAS","SAS","SAS","SAS"]

I’ll create a column labeling super populations in the margin. This is a workaround b/c I’m having issues with row stubs.

sup_pop_alt = ["","","","AFR","","","",
               "","AMR","","",
               "","EAS","EAS","","",
               "","","EUR","","",
               "","","SAS","",""]

Expanded background information about each population is shown below. I’ll edit this by hand to make further columns.

# full population details - for notes columns
pop_details = ['African Caribbean in Barbados', #
 'African Ancestry in SW USA',
 'Esan in Nigeria',
 'Gambian in Western Division – Mandinka',
 'Luhya in Webuye, Kenya',
 'Mende in Sierra Leone',
 'Yoruba in Ibadan, Nigeria',
 'Colombian in Medellín, Colombia',
 'Mexican Ancestry in L.A. CA, USA',
 'Peruvian in Lima Peru',
 'Puerto Rican in Puerto Rico',
 'Chinese Dai in Xishuangbanna, China', #Chin
 'Han Chinese in Beijing, China',
 'Han Chinese - Southern, China',
 'Japanese in Tokyo, Japan',
 'Kinh in Ho Chi Minh City, Vietnam',
 'Utah Residents (CEPH) with Northern and Western European ancestry',
 'Finnish in Finland',
 'British from England and Scotland',
 'Iberian Populations in Spain',
 'Toscani in Italia',
 'Bengali in Bangladesh',
 'Gujarati Indians in Houston, Texas, USA',
 'Indian Telugu in the U.K.',
 'Punjabi in Lahore, Pakistan [PJL]',
 'Sri Lankan Tamil in the UK']

Based on the information from the full population details, I assigned approximate ethnic, cultural, or similar groups that share demographic or cultural connections. These are approximate (I’m not a social scientist) and based only on information provided by the 1000 Genomes project. Feel free to contact me at brouwern at gmail.com with any suggested updates.

ethnic_group = [
 'Afro-Caribbean',
 'African American',
 'Esan',
 'Mandinka',
 'Luhya',
 'Mende',
 'Yoruba',
 'Colombian',
 'Mexican-American',
 'Peruvian',
 'Puerto Rican',
 'Dai', # China
 'Han-North',
 'Han-South',
 'Japanese',
 'Kinh', # in Ho Chi Minh City, Vietnam
 'Euro-American',
 'Finnish',
 'British / Scottish',
 'Iberian',
 'Tuscan',
 'Bengali',
 'Gujarati', # Indians in Houston, Texas, USA
 'Telugu', # in the U.K.
 'Punjabi', # in Lahore, Pakistan [PJL]
 'Tamil'] #Sri Lankan Tamil in the UK

This next list contains the country the samples were collected in, their historical geographic origins, or where the group has the largest populations. In several cases data were collected “in diaspora” and the geographic location assigned to the origin as reported by the 1000 Genomes Project. Therefore, Indians living in the UK were assigned to India. For African-American and European-Americans, “Africa” and “Europe” were used.

Note that the Dai were sampled from China. Wikipedia states “The Dai people form one of the 56 ethnic groups officially recognized by the People’s Republic of China. By extension, the term can apply to groups in Laos, Vietnam, Thailand and Myanmar when Dai is used to mean specifically Tai Yai, Lue, Chinese Shan, Tai Dam, Tai Khao or even Tai in general.” (en.Wikipedia.org/wiki/Dai_people) I therefore coded them as “SE-Asia”

geographic_sample = [
 'Barbados',
 'Africa', # sampled in USA
 'Nigeria', #Nigeria - 1
 'Gambia',
 'Kenya',
 'Sierra Leone',
 'Nigeria', #Nigeria - 2
 'Colombia',
 'Mexico',
 'Peru',
 'P. Rico',
 'SE-Asia',# Chinese Dai in Xishuangbanna
 'China-NE',# Han Chinese - 1; Beijing
 'China-SE',# Han Chinese - 2; Southern China
 'Japan',
 'Vietnam',
 'W. Europe', #European Union
 'Finland',
 'UK',
 'Spain',
 'Italy',
 'Bangladesh',
 'Gujarat',#India Gujarat, sampled in UK
 'Andhra Pradesh',#India Telugu, sampled in UK
 'Punjab',# Punjab, Pakistan
 'Sri Lanka']

Finally, this list contains the country where a sample was actually obtained from.

# country where sample was taken
country_sample = [
 'Barbados',
 'USA', # African American
 'Nigeria', #Nigeria - 1
 'Gambia',
 'Kenya',
 'Sierra Leone',
 'Nigeria', #Nigeria - 2
 'Colombia',
 'Mexico',
 'Peru',
 'Puerto Rico',
 'China',# Chinese Dai in Xishuangbanna
 'China',# Han Chinese - 1; Beijing
 'China',# Han Chinese - 2; Southern China
 'Japan',
 'Vietnam',
 'USA', #European Union
 'Finland',
 'UK',
 'Spain',
 'Italy',
 'Bangladesh',
 'UK',#India Gujarat, sampled in UK
 'UK',#India Telugu, sampled in UK
 'Pakistan',# Punjab
 'UK']

Names of flag files for use as icons.

For African Americans, the Pan-Africa frag is used, though there is likely a better choice.

Unresolved:

For European Americans the European Union flag could be used used.
Chinese Dai?
Gujarat - Use state flag?
Telugu - use state flag?

The Chin

# for inserting flags
icon = ['barbados.png',
 'pan_africa_flag.png', # African Union? Pan-Africa
 'nigeria.png', #Nigeria
 'gambia.png',
 'kenya.png',
 'sierra_leone.png',
 'nigeria.png', #Nigeria
 'colombia.png',
 'mexico.png',
 'peru.png',
 'puerto_rico.png',
 'china.png',# Chinese Dai in Xishuangbanna
 'china.png',# Han Chinese in Beijing
 'china.png',# Han Chinese - Southern, China
 'japan.png',
 'vietnam.png',
 'USA.png', #European Union
 'finland.png',
 'UK.png',
 'spain.png',
 'italy.png',
 'bangladesh.png',
 'india.png',#india Gujarat
 'india.png',#India Telugu
 'pakistan.png',#India  Punjab
 'sri_lanka.png']

Finalizing distance matrix

Stored and Remove the meta-data columns.

# make copy
sup_pop_and_pop = dist_mat_pd[["sup_pop","pop"]]


# remove metadata
dist_mat_only = dist_mat_pd.drop(["sup_pop","pop"], axis = 1)

# save new column names
cols_dist_mat_only = dist_mat_only.columns

Set the columns to None. In case numbers are shown on the final table, I want the diagonal to be blank, which can be done in great_tables. This could also be done with numpy.diagonal, but in this case I knew that the diagonal elements are all 0 (or very very close).

#dist_mat_only[dist_mat_only< 0.01] = None

Scaling:

In order to increase the contrast based on color all elements are scaled relative to the maximum value observed

# find min/max for scaling
global_max = dist_mat_only.max().max()
dist_mat_only = (dist_mat_only)/global_max*100

Work with upper triangle

Set off diagonal to 0

dist_mat_tri0 = np.tril(dist_mat_only)

Reformat to dataframe

dist_mat_only2 = pd.DataFrame.from_records(dist_mat_tri0)

Set values to None

dist_mat_only2[dist_mat_only2< 0.001] = None

Add the columns back

dist_mat_only2.columns = cols_dist_mat_only

Looks like numpy un-rounded us

dist_mat_only2 = dist_mat_only2.round()

dist_mat_only2.head()

    ACB   ASW   ESN   GWD  LWK  MSL  YRI  ...  IBS  TSI  BEB  GIH  ITU  PJL  STU
0   NaN   NaN   NaN   NaN  NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1  18.0   NaN   NaN   NaN  NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2  14.0  23.0   NaN   NaN  NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN  NaN  NaN  NaN
3  15.0  25.0  19.0   NaN  NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN  NaN  NaN  NaN
4  16.0  25.0  16.0  21.0  NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN  NaN  NaN  NaN

[5 rows x 26 columns]

location_plus_space =[i + j for i, j in zip(geographic_sample,
                                            np.repeat("<br>",dist_mat_pd.shape[0]))]


location_plus_code = [i + j for i, j in zip(location_plus_space,
cols_dist_mat_only)]

Add meta-data

dist_mat_only2["oneKG_pop"] = cols_dist_mat_only
dist_mat_only2["location"] = geographic_sample #
dist_mat_only2["location_combo"] = location_plus_code #
dist_mat_only2["pop_details"] = pop_details
dist_mat_only2["ethnic_cultural_group"] = ethnic_group
dist_mat_only2["sup_pop"] = sup_pop_orig
dist_mat_only2["sup_pop_alt"] = sup_pop_alt
dist_mat_only2["icon"] = icon

Inserting empty rows

I experimented with this but did not use in the final table

filler_row = pd.DataFrame(np.repeat(None,dist_mat_only2.shape[1])).T
filler_row.columns = dist_mat_pd.columns
dist_mat_only3 = pd.concat([dist_mat_only2.iloc[:7], 
              filler_row, 
              dist_mat_only2.iloc[7:11],
              filler_row, 
              dist_mat_only2.iloc[11:15],
              filler_row, 
              dist_mat_only2.iloc[16:20],
              filler_row, 
              dist_mat_only2.iloc[21:]
              ]).reset_index(drop=True)

great_tables groupings etc.

The series below define key groups of rows / columns.

# columns that will be numeric in table
num_cols = ['ACB','ASW', 'ESN', 'GWD', 'LWK', 'MSL', 'YRI',
            'CLM', 'MXL', 'PEL', 'PUR',
            'CDX', 'CHB', 'CHS', 'JPT', 'KHV',
            'CEU','FIN','GBR','IBS','TSI', 
            'BEB', 'GIH', 'ITU', 'PJL', 'STU']

# super population groups
## for creating spanners in table
africa_AFR    = ['ACB','ASW','ESN','GWD','LWK','MSL','YRI']
americas_AMR  = ['CLM','MXL','PEL','PUR']
east_asia_EAS = ['CDX','CHB','CHS','JPT','KHV']
europe_EUR    = ['CEU','FIN','GBR','IBS','TSI']
southasia_SAS = ['BEB','GIH','ITU','PJL','STU']

Preview the full table with bare call to GT()

GT(dist_mat_only2)

ACB	ASW	ESN	GWD	LWK	MSL	YRI	CLM	MXL	PEL	PUR	CDX	CHB	CHS	JPT	KHV	CEU	FIN	GBR	IBS	TSI	BEB	GIH	ITU	PJL	STU	oneKG_pop	location	location_combo	pop_details	ethnic_cultural_group	sup_pop	sup_pop_alt	icon
																										ACB	Barbados	Barbados ACB	African Caribbean in Barbados	Afro-Caribbean	AFR		barbados.png
18.0																										ASW	Africa	Africa ASW	African Ancestry in SW USA	African American	AFR		pan_africa_flag.png
14.0	23.0																									ESN	Nigeria	Nigeria ESN	Esan in Nigeria	Esan	AFR		nigeria.png
15.0	25.0	19.0																								GWD	Gambia	Gambia GWD	Gambian in Western Division – Mandinka	Mandinka	AFR	AFR	gambia.png
16.0	25.0	16.0	21.0																							LWK	Kenya	Kenya LWK	Luhya in Webuye, Kenya	Luhya	AFR		kenya.png
17.0	25.0	18.0	17.0	19.0																						MSL	Sierra Leone	Sierra Leone MSL	Mende in Sierra Leone	Mende	AFR		sierra_leone.png
16.0	24.0	15.0	19.0	19.0	16.0																					YRI	Nigeria	Nigeria YRI	Yoruba in Ibadan, Nigeria	Yoruba	AFR		nigeria.png
63.0	53.0	70.0	67.0	69.0	72.0	71.0																				CLM	Colombia	Colombia CLM	Colombian in Medellín, Colombia	Colombian	AMR		colombia.png
61.0	54.0	68.0	66.0	67.0	70.0	69.0	22.0																			MXL	Mexico	Mexico MXL	Mexican Ancestry in L.A. CA, USA	Mexican-American	AMR	AMR	mexico.png
64.0	58.0	71.0	68.0	69.0	72.0	71.0	28.0	15.0																		PEL	Peru	Peru PEL	Peruvian in Lima Peru	Peruvian	AMR		peru.png
60.0	49.0	67.0	64.0	67.0	69.0	68.0	14.0	27.0	34.0																	PUR	P. Rico	P. Rico PUR	Puerto Rican in Puerto Rico	Puerto Rican	AMR		puerto_rico.png
77.0	77.0	81.0	81.0	79.0	83.0	84.0	68.0	52.0	56.0	71.0																CDX	SE-Asia	SE-Asia CDX	Chinese Dai in Xishuangbanna, China	Dai	EAS		china.png
77.0	77.0	82.0	81.0	80.0	83.0	84.0	69.0	53.0	57.0	72.0	16.0															CHB	China-NE	China-NE CHB	Han Chinese in Beijing, China	Han-North	EAS	EAS	china.png
77.0	76.0	82.0	81.0	79.0	83.0	83.0	67.0	51.0	56.0	70.0	11.0	8.0														CHS	China-SE	China-SE CHS	Han Chinese - Southern, China	Han-South	EAS	EAS	china.png
82.0	83.0	86.0	85.0	84.0	88.0	88.0	75.0	59.0	63.0	79.0	21.0	16.0	16.0													JPT	Japan	Japan JPT	Japanese in Tokyo, Japan	Japanese	EAS		japan.png
75.0	74.0	79.0	78.0	77.0	81.0	81.0	66.0	51.0	56.0	69.0	8.0	14.0	10.0	21.0												KHV	Vietnam	Vietnam KHV	Kinh in Ho Chi Minh City, Vietnam	Kinh	EAS		vietnam.png
77.0	66.0	83.0	81.0	84.0	86.0	86.0	27.0	43.0	48.0	25.0	87.0	88.0	86.0	95.0	85.0											CEU	W. Europe	W. Europe CEU	Utah Residents (CEPH) with Northern and Western European ancestry	Euro-American	EUR		USA.png
82.0	72.0	88.0	85.0	89.0	91.0	91.0	37.0	52.0	54.0	38.0	93.0	95.0	93.0	100.0	92.0	18.0										FIN	Finland	Finland FIN	Finnish in Finland	Finnish	EUR		finland.png
76.0	65.0	82.0	79.0	83.0	85.0	84.0	27.0	43.0	46.0	26.0	88.0	89.0	87.0	95.0	86.0	9.0	16.0									GBR	UK	UK GBR	British from England and Scotland	British / Scottish	EUR	EUR	UK.png
69.0	58.0	76.0	73.0	76.0	78.0	77.0	17.0	31.0	39.0	14.0	75.0	76.0	74.0	83.0	73.0	18.0	33.0	21.0								IBS	Spain	Spain IBS	Iberian Populations in Spain	Iberian	EUR		spain.png
69.0	58.0	76.0	74.0	76.0	78.0	78.0	19.0	35.0	40.0	15.0	79.0	79.0	78.0	87.0	77.0	16.0	30.0	19.0	10.0							TSI	Italy	Italy TSI	Toscani in Italia	Tuscan	EUR		italy.png
56.0	48.0	61.0	60.0	61.0	62.0	60.0	38.0	36.0	40.0	37.0	68.0	65.0	65.0	72.0	64.0	55.0	66.0	55.0	43.0	45.0						BEB	Bangladesh	Bangladesh BEB	Bengali in Bangladesh	Bengali	SAS		bangladesh.png
57.0	49.0	63.0	62.0	62.0	64.0	62.0	39.0	36.0	42.0	36.0	65.0	61.0	62.0	71.0	61.0	55.0	68.0	56.0	42.0	43.0	15.0					GIH	Gujarat	Gujarat GIH	Gujarati Indians in Houston, Texas, USA	Gujarati	SAS		india.png
56.0	48.0	61.0	60.0	61.0	61.0	59.0	45.0	43.0	46.0	43.0	72.0	69.0	70.0	77.0	69.0	61.0	73.0	61.0	50.0	51.0	11.0	18.0				ITU	Andhra Pradesh	Andhra Pradesh ITU	Indian Telugu in the U.K.	Telugu	SAS	SAS	india.png
58.0	48.0	64.0	62.0	63.0	65.0	63.0	26.0	28.0	33.0	25.0	67.0	66.0	65.0	74.0	65.0	44.0	55.0	43.0	32.0	33.0	19.0	20.0	24.0			PJL	Punjab	Punjab PJL	Punjabi in Lahore, Pakistan [PJL]	Punjabi	SAS		pakistan.png
57.0	50.0	61.0	61.0	61.0	61.0	59.0	47.0	45.0	48.0	46.0	72.0	68.0	69.0	76.0	69.0	64.0	75.0	64.0	53.0	54.0	12.0	20.0	9.0	27.0		STU	Sri Lanka	Sri Lanka STU	Sri Lankan Tamil in the UK	Tamil	SAS		sri_lanka.png

Hidden columns:

‘location’
‘pop_details’
‘sup_pop’ (from sup_pop_orig)
‘sup_pop_alt’

Used:

‘icon’
‘oneKG_pop’
‘ethnic_cultural_group’

Preview the table with all major elements (not rendered)

from great_tables import GT, md, html, style, loc
# make my table
(
    GT(dist_mat_only2).tab_style(
        style=style.text(size = "0"),
        locations=loc.body(columns=num_cols)
    )
    .tab_header(title="Genetic distance between 1000 Genomes populations",
     subtitle="Subtitle goes here"
     )
    .cols_hide(columns=[
      "location", previously "country"
      "pop_details",
      "sup_pop",
      "sup_pop_alt"])
    .fmt_image("icon", path="icons/")
    #.tab_stub(
    #  rowname_col="row", 
    #  groupname_col="sup_pop"
    #)
    .tab_spanner(
        label="Group",
        columns=[
                #"sup_pop", 
                "oneKG_pop",
                "icon",
                "location"
                ]
    )
    .tab_spanner(
        label="Africa",
        columns=africa_AFR
    )
    .tab_spanner(
        label="Americas",
        columns=americas_AMR
    )
    .tab_spanner(
        label="East Asia",
        columns=east_asia_EAS
    )
    .tab_spanner(
        label="Western Europe",
        columns=europe_EUR
    )
    .tab_spanner(
        label="South Asia",
        columns=southasia_SAS
    )
    .tab_spanner(
        label="Genetic Distances",
        columns=num_cols
    )
    .tab_spanner(
        label="Notes",
        columns=[
                 "ethnic_cultural_group"#,
                 #"country_sampling"
                 ]
    )
    #.cols_move_to_start(columns=[#"sup_pop", 
    #                            "location",
    #                            "oneKG_pop",
    #                            "icon"])
    #.cols_label(
    #    #sup_pop = "Super population",
    #    oneKG_pop = "Population Code",
    #    #country_sampling = "Sampling location",
    #    ethnic_cultural_group = "Ethnic / <br> Cultural Group",
    #    icon = "Geographic origin"
    #)
    .cols_align(align="center")
    .fmt_number(columns=num_cols, decimals=0, use_seps=False)
    #.sub_zero(zero_text='nil')
    .data_color(
        domain=[90, 0],
        palette=["white", "orange"],
        na_color="white",
    )
    .tab_source_note(source_note="Add citation")


)

Save the final file

# Save final table
dist_mat_only2.to_csv("dist_mat_final.csv", index= None)