1000genomes
  • Home
  • About
  • Final Data Preparation

Final Data Preparation

Author

Nathan Brouwer (brouwern@gmail.com)

Introduction

Data for this project was processed from the original download in .gz format. This script makes final changes related to formatting the data for the table, including setting adding metadata columns.

Preliminaries

Load reticulated so I can work in Python while in RStudio:

library(reticulate)
#py_install("pandas")
#py_install("polars")
#py_install("seaborn")

Load the Python packages:

#import polars as pl
import pandas as pd
import numpy as np
from great_tables import GT, md, html, style, loc

Load the distance matrix. The _pd suffix is a reminder that I’m working in Pandas. Eventually I hope to port as much of these final tasks to Polars.

dist_mat_pd = pd.read_csv("distance_matrix.csv")

Finalizing dataframe

Additional meta-data

The dataframe produced from the previous script is a square matrix with two columns of meta-data appended on the far left. (sup_pop = super population, pop = population)

dist_mat_pd.head()
  sup_pop  pop       ACB       ASW  ...        GIH        ITU        PJL        STU
0     AFR  ACB  0.000006  4.319538  ...  13.771366  13.447868  13.840908  13.605249
1     AFR  ASW  4.319538  0.000000  ...  11.767316  11.552621  11.581944  11.916708
2     AFR  ESN  3.248675  5.605990  ...  15.177879  14.626126  15.380010  14.665675
3     AFR  GWD  3.705924  5.986606  ...  14.915595  14.495067  14.798274  14.546724
4     AFR  LWK  3.790766  5.945784  ...  14.968611  14.537910  15.213237  14.698901

[5 rows x 28 columns]

The super populations are already in the table, but here they are in a list for reference:

sup_pop_orig = ["AFR","AFR","AFR","AFR","AFR","AFR","AFR",
                "AMR","AMR","AMR","AMR",
                "EAS","EAS","EAS","EAS","EAS",
                "EUR","EUR","EUR","EUR","EUR",
                "SAS","SAS","SAS","SAS","SAS"]

I’ll create a column labeling super populations in the margin. This is a workaround b/c I’m having issues with row stubs.

sup_pop_alt = ["","","","AFR","","","",
               "","AMR","","",
               "","EAS","EAS","","",
               "","","EUR","","",
               "","","SAS","",""]

Expanded background information about each population is shown below. I’ll edit this by hand to make further columns.

# full population details - for notes columns
pop_details = ['African Caribbean in Barbados', #
 'African Ancestry in SW USA',
 'Esan in Nigeria',
 'Gambian in Western Division – Mandinka',
 'Luhya in Webuye, Kenya',
 'Mende in Sierra Leone',
 'Yoruba in Ibadan, Nigeria',
 'Colombian in Medellín, Colombia',
 'Mexican Ancestry in L.A. CA, USA',
 'Peruvian in Lima Peru',
 'Puerto Rican in Puerto Rico',
 'Chinese Dai in Xishuangbanna, China', #Chin
 'Han Chinese in Beijing, China',
 'Han Chinese - Southern, China',
 'Japanese in Tokyo, Japan',
 'Kinh in Ho Chi Minh City, Vietnam',
 'Utah Residents (CEPH) with Northern and Western European ancestry',
 'Finnish in Finland',
 'British from England and Scotland',
 'Iberian Populations in Spain',
 'Toscani in Italia',
 'Bengali in Bangladesh',
 'Gujarati Indians in Houston, Texas, USA',
 'Indian Telugu in the U.K.',
 'Punjabi in Lahore, Pakistan [PJL]',
 'Sri Lankan Tamil in the UK']

Based on the information from the full population details, I assigned approximate ethnic, cultural, or similar groups that share demographic or cultural connections. These are approximate (I’m not a social scientist) and based only on information provided by the 1000 Genomes project. Feel free to contact me at brouwern at gmail.com with any suggested updates.

ethnic_group = [
 'Afro-Caribbean',
 'African American',
 'Esan',
 'Mandinka',
 'Luhya',
 'Mende',
 'Yoruba',
 'Colombian',
 'Mexican-American',
 'Peruvian',
 'Puerto Rican',
 'Dai', # China
 'Han-North',
 'Han-South',
 'Japanese',
 'Kinh', # in Ho Chi Minh City, Vietnam
 'Euro-American',
 'Finnish',
 'British / Scottish',
 'Iberian',
 'Tuscan',
 'Bengali',
 'Gujarati', # Indians in Houston, Texas, USA
 'Telugu', # in the U.K.
 'Punjabi', # in Lahore, Pakistan [PJL]
 'Tamil'] #Sri Lankan Tamil in the UK

This next list contains the country the samples were collected in, their historical geographic origins, or where the group has the largest populations. In several cases data were collected “in diaspora” and the geographic location assigned to the origin as reported by the 1000 Genomes Project. Therefore, Indians living in the UK were assigned to India. For African-American and European-Americans, “Africa” and “Europe” were used.

Note that the Dai were sampled from China. Wikipedia states “The Dai people form one of the 56 ethnic groups officially recognized by the People’s Republic of China. By extension, the term can apply to groups in Laos, Vietnam, Thailand and Myanmar when Dai is used to mean specifically Tai Yai, Lue, Chinese Shan, Tai Dam, Tai Khao or even Tai in general.” (en.Wikipedia.org/wiki/Dai_people) I therefore coded them as “SE-Asia”

geographic_sample = [
 'Barbados',
 'Africa', # sampled in USA
 'Nigeria', #Nigeria - 1
 'Gambia',
 'Kenya',
 'Sierra Leone',
 'Nigeria', #Nigeria - 2
 'Colombia',
 'Mexico',
 'Peru',
 'P. Rico',
 'SE-Asia',# Chinese Dai in Xishuangbanna
 'China-NE',# Han Chinese - 1; Beijing
 'China-SE',# Han Chinese - 2; Southern China
 'Japan',
 'Vietnam',
 'W. Europe', #European Union
 'Finland',
 'UK',
 'Spain',
 'Italy',
 'Bangladesh',
 'Gujarat',#India Gujarat, sampled in UK
 'Andhra Pradesh',#India Telugu, sampled in UK
 'Punjab',# Punjab, Pakistan
 'Sri Lanka']

Finally, this list contains the country where a sample was actually obtained from.

# country where sample was taken
country_sample = [
 'Barbados',
 'USA', # African American
 'Nigeria', #Nigeria - 1
 'Gambia',
 'Kenya',
 'Sierra Leone',
 'Nigeria', #Nigeria - 2
 'Colombia',
 'Mexico',
 'Peru',
 'Puerto Rico',
 'China',# Chinese Dai in Xishuangbanna
 'China',# Han Chinese - 1; Beijing
 'China',# Han Chinese - 2; Southern China
 'Japan',
 'Vietnam',
 'USA', #European Union
 'Finland',
 'UK',
 'Spain',
 'Italy',
 'Bangladesh',
 'UK',#India Gujarat, sampled in UK
 'UK',#India Telugu, sampled in UK
 'Pakistan',# Punjab
 'UK']

Names of flag files for use as icons.

For African Americans, the Pan-Africa frag is used, though there is likely a better choice.

Unresolved:

  • For European Americans the European Union flag could be used used.
  • Chinese Dai?
  • Gujarat - Use state flag?
  • Telugu - use state flag?

The Chin

# for inserting flags
icon = ['barbados.png',
 'pan_africa_flag.png', # African Union? Pan-Africa
 'nigeria.png', #Nigeria
 'gambia.png',
 'kenya.png',
 'sierra_leone.png',
 'nigeria.png', #Nigeria
 'colombia.png',
 'mexico.png',
 'peru.png',
 'puerto_rico.png',
 'china.png',# Chinese Dai in Xishuangbanna
 'china.png',# Han Chinese in Beijing
 'china.png',# Han Chinese - Southern, China
 'japan.png',
 'vietnam.png',
 'USA.png', #European Union
 'finland.png',
 'UK.png',
 'spain.png',
 'italy.png',
 'bangladesh.png',
 'india.png',#india Gujarat
 'india.png',#India Telugu
 'pakistan.png',#India  Punjab
 'sri_lanka.png']

Finalizing distance matrix

Stored and Remove the meta-data columns.

# make copy
sup_pop_and_pop = dist_mat_pd[["sup_pop","pop"]]


# remove metadata
dist_mat_only = dist_mat_pd.drop(["sup_pop","pop"], axis = 1)

# save new column names
cols_dist_mat_only = dist_mat_only.columns

Set the columns to None. In case numbers are shown on the final table, I want the diagonal to be blank, which can be done in great_tables. This could also be done with numpy.diagonal, but in this case I knew that the diagonal elements are all 0 (or very very close).

#dist_mat_only[dist_mat_only< 0.01] = None

Scaling:

In order to increase the contrast based on color all elements are scaled relative to the maximum value observed

# find min/max for scaling
global_max = dist_mat_only.max().max()
dist_mat_only = (dist_mat_only)/global_max*100

Work with upper triangle

Set off diagonal to 0

dist_mat_tri0 = np.tril(dist_mat_only)

Reformat to dataframe

dist_mat_only2 = pd.DataFrame.from_records(dist_mat_tri0)

Set values to None

dist_mat_only2[dist_mat_only2< 0.001] = None

Add the columns back

dist_mat_only2.columns = cols_dist_mat_only

Looks like numpy un-rounded us

dist_mat_only2 = dist_mat_only2.round()
dist_mat_only2.head()
    ACB   ASW   ESN   GWD  LWK  MSL  YRI  ...  IBS  TSI  BEB  GIH  ITU  PJL  STU
0   NaN   NaN   NaN   NaN  NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1  18.0   NaN   NaN   NaN  NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2  14.0  23.0   NaN   NaN  NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN  NaN  NaN  NaN
3  15.0  25.0  19.0   NaN  NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN  NaN  NaN  NaN
4  16.0  25.0  16.0  21.0  NaN  NaN  NaN  ...  NaN  NaN  NaN  NaN  NaN  NaN  NaN

[5 rows x 26 columns]
location_plus_space =[i + j for i, j in zip(geographic_sample,
                                            np.repeat("<br>",dist_mat_pd.shape[0]))]


location_plus_code = [i + j for i, j in zip(location_plus_space,
cols_dist_mat_only)]

Add meta-data

dist_mat_only2["oneKG_pop"] = cols_dist_mat_only
dist_mat_only2["location"] = geographic_sample #
dist_mat_only2["location_combo"] = location_plus_code #
dist_mat_only2["pop_details"] = pop_details
dist_mat_only2["ethnic_cultural_group"] = ethnic_group
dist_mat_only2["sup_pop"] = sup_pop_orig
dist_mat_only2["sup_pop_alt"] = sup_pop_alt
dist_mat_only2["icon"] = icon

Inserting empty rows

I experimented with this but did not use in the final table

filler_row = pd.DataFrame(np.repeat(None,dist_mat_only2.shape[1])).T
filler_row.columns = dist_mat_pd.columns
dist_mat_only3 = pd.concat([dist_mat_only2.iloc[:7], 
              filler_row, 
              dist_mat_only2.iloc[7:11],
              filler_row, 
              dist_mat_only2.iloc[11:15],
              filler_row, 
              dist_mat_only2.iloc[16:20],
              filler_row, 
              dist_mat_only2.iloc[21:]
              ]).reset_index(drop=True)

great_tables groupings etc.

The series below define key groups of rows / columns.

# columns that will be numeric in table
num_cols = ['ACB','ASW', 'ESN', 'GWD', 'LWK', 'MSL', 'YRI',
            'CLM', 'MXL', 'PEL', 'PUR',
            'CDX', 'CHB', 'CHS', 'JPT', 'KHV',
            'CEU','FIN','GBR','IBS','TSI', 
            'BEB', 'GIH', 'ITU', 'PJL', 'STU']

# super population groups
## for creating spanners in table
africa_AFR    = ['ACB','ASW','ESN','GWD','LWK','MSL','YRI']
americas_AMR  = ['CLM','MXL','PEL','PUR']
east_asia_EAS = ['CDX','CHB','CHS','JPT','KHV']
europe_EUR    = ['CEU','FIN','GBR','IBS','TSI']
southasia_SAS = ['BEB','GIH','ITU','PJL','STU']

Preview the full table with bare call to GT()

GT(dist_mat_only2)
ACB ASW ESN GWD LWK MSL YRI CLM MXL PEL PUR CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI BEB GIH ITU PJL STU oneKG_pop location location_combo pop_details ethnic_cultural_group sup_pop sup_pop_alt icon
ACB Barbados Barbados
ACB
African Caribbean in Barbados Afro-Caribbean AFR barbados.png
18.0 ASW Africa Africa
ASW
African Ancestry in SW USA African American AFR pan_africa_flag.png
14.0 23.0 ESN Nigeria Nigeria
ESN
Esan in Nigeria Esan AFR nigeria.png
15.0 25.0 19.0 GWD Gambia Gambia
GWD
Gambian in Western Division – Mandinka Mandinka AFR AFR gambia.png
16.0 25.0 16.0 21.0 LWK Kenya Kenya
LWK
Luhya in Webuye, Kenya Luhya AFR kenya.png
17.0 25.0 18.0 17.0 19.0 MSL Sierra Leone Sierra Leone
MSL
Mende in Sierra Leone Mende AFR sierra_leone.png
16.0 24.0 15.0 19.0 19.0 16.0 YRI Nigeria Nigeria
YRI
Yoruba in Ibadan, Nigeria Yoruba AFR nigeria.png
63.0 53.0 70.0 67.0 69.0 72.0 71.0 CLM Colombia Colombia
CLM
Colombian in Medellín, Colombia Colombian AMR colombia.png
61.0 54.0 68.0 66.0 67.0 70.0 69.0 22.0 MXL Mexico Mexico
MXL
Mexican Ancestry in L.A. CA, USA Mexican-American AMR AMR mexico.png
64.0 58.0 71.0 68.0 69.0 72.0 71.0 28.0 15.0 PEL Peru Peru
PEL
Peruvian in Lima Peru Peruvian AMR peru.png
60.0 49.0 67.0 64.0 67.0 69.0 68.0 14.0 27.0 34.0 PUR P. Rico P. Rico
PUR
Puerto Rican in Puerto Rico Puerto Rican AMR puerto_rico.png
77.0 77.0 81.0 81.0 79.0 83.0 84.0 68.0 52.0 56.0 71.0 CDX SE-Asia SE-Asia
CDX
Chinese Dai in Xishuangbanna, China Dai EAS china.png
77.0 77.0 82.0 81.0 80.0 83.0 84.0 69.0 53.0 57.0 72.0 16.0 CHB China-NE China-NE
CHB
Han Chinese in Beijing, China Han-North EAS EAS china.png
77.0 76.0 82.0 81.0 79.0 83.0 83.0 67.0 51.0 56.0 70.0 11.0 8.0 CHS China-SE China-SE
CHS
Han Chinese - Southern, China Han-South EAS EAS china.png
82.0 83.0 86.0 85.0 84.0 88.0 88.0 75.0 59.0 63.0 79.0 21.0 16.0 16.0 JPT Japan Japan
JPT
Japanese in Tokyo, Japan Japanese EAS japan.png
75.0 74.0 79.0 78.0 77.0 81.0 81.0 66.0 51.0 56.0 69.0 8.0 14.0 10.0 21.0 KHV Vietnam Vietnam
KHV
Kinh in Ho Chi Minh City, Vietnam Kinh EAS vietnam.png
77.0 66.0 83.0 81.0 84.0 86.0 86.0 27.0 43.0 48.0 25.0 87.0 88.0 86.0 95.0 85.0 CEU W. Europe W. Europe
CEU
Utah Residents (CEPH) with Northern and Western European ancestry Euro-American EUR USA.png
82.0 72.0 88.0 85.0 89.0 91.0 91.0 37.0 52.0 54.0 38.0 93.0 95.0 93.0 100.0 92.0 18.0 FIN Finland Finland
FIN
Finnish in Finland Finnish EUR finland.png
76.0 65.0 82.0 79.0 83.0 85.0 84.0 27.0 43.0 46.0 26.0 88.0 89.0 87.0 95.0 86.0 9.0 16.0 GBR UK UK
GBR
British from England and Scotland British / Scottish EUR EUR UK.png
69.0 58.0 76.0 73.0 76.0 78.0 77.0 17.0 31.0 39.0 14.0 75.0 76.0 74.0 83.0 73.0 18.0 33.0 21.0 IBS Spain Spain
IBS
Iberian Populations in Spain Iberian EUR spain.png
69.0 58.0 76.0 74.0 76.0 78.0 78.0 19.0 35.0 40.0 15.0 79.0 79.0 78.0 87.0 77.0 16.0 30.0 19.0 10.0 TSI Italy Italy
TSI
Toscani in Italia Tuscan EUR italy.png
56.0 48.0 61.0 60.0 61.0 62.0 60.0 38.0 36.0 40.0 37.0 68.0 65.0 65.0 72.0 64.0 55.0 66.0 55.0 43.0 45.0 BEB Bangladesh Bangladesh
BEB
Bengali in Bangladesh Bengali SAS bangladesh.png
57.0 49.0 63.0 62.0 62.0 64.0 62.0 39.0 36.0 42.0 36.0 65.0 61.0 62.0 71.0 61.0 55.0 68.0 56.0 42.0 43.0 15.0 GIH Gujarat Gujarat
GIH
Gujarati Indians in Houston, Texas, USA Gujarati SAS india.png
56.0 48.0 61.0 60.0 61.0 61.0 59.0 45.0 43.0 46.0 43.0 72.0 69.0 70.0 77.0 69.0 61.0 73.0 61.0 50.0 51.0 11.0 18.0 ITU Andhra Pradesh Andhra Pradesh
ITU
Indian Telugu in the U.K. Telugu SAS SAS india.png
58.0 48.0 64.0 62.0 63.0 65.0 63.0 26.0 28.0 33.0 25.0 67.0 66.0 65.0 74.0 65.0 44.0 55.0 43.0 32.0 33.0 19.0 20.0 24.0 PJL Punjab Punjab
PJL
Punjabi in Lahore, Pakistan [PJL] Punjabi SAS pakistan.png
57.0 50.0 61.0 61.0 61.0 61.0 59.0 47.0 45.0 48.0 46.0 72.0 68.0 69.0 76.0 69.0 64.0 75.0 64.0 53.0 54.0 12.0 20.0 9.0 27.0 STU Sri Lanka Sri Lanka
STU
Sri Lankan Tamil in the UK Tamil SAS sri_lanka.png

Hidden columns:

  • ‘location’
  • ‘pop_details’
  • ‘sup_pop’ (from sup_pop_orig)
  • ‘sup_pop_alt’

Used:

  • ‘icon’
  • ‘oneKG_pop’
  • ‘ethnic_cultural_group’

Preview the table with all major elements (not rendered)

from great_tables import GT, md, html, style, loc
# make my table
(
    GT(dist_mat_only2).tab_style(
        style=style.text(size = "0"),
        locations=loc.body(columns=num_cols)
    )
    .tab_header(title="Genetic distance between 1000 Genomes populations",
     subtitle="Subtitle goes here"
     )
    .cols_hide(columns=[
      "location", previously "country"
      "pop_details",
      "sup_pop",
      "sup_pop_alt"])
    .fmt_image("icon", path="icons/")
    #.tab_stub(
    #  rowname_col="row", 
    #  groupname_col="sup_pop"
    #)
    .tab_spanner(
        label="Group",
        columns=[
                #"sup_pop", 
                "oneKG_pop",
                "icon",
                "location"
                ]
    )
    .tab_spanner(
        label="Africa",
        columns=africa_AFR
    )
    .tab_spanner(
        label="Americas",
        columns=americas_AMR
    )
    .tab_spanner(
        label="East Asia",
        columns=east_asia_EAS
    )
    .tab_spanner(
        label="Western Europe",
        columns=europe_EUR
    )
    .tab_spanner(
        label="South Asia",
        columns=southasia_SAS
    )
    .tab_spanner(
        label="Genetic Distances",
        columns=num_cols
    )
    .tab_spanner(
        label="Notes",
        columns=[
                 "ethnic_cultural_group"#,
                 #"country_sampling"
                 ]
    )
    #.cols_move_to_start(columns=[#"sup_pop", 
    #                            "location",
    #                            "oneKG_pop",
    #                            "icon"])
    #.cols_label(
    #    #sup_pop = "Super population",
    #    oneKG_pop = "Population Code",
    #    #country_sampling = "Sampling location",
    #    ethnic_cultural_group = "Ethnic / <br> Cultural Group",
    #    icon = "Geographic origin"
    #)
    .cols_align(align="center")
    .fmt_number(columns=num_cols, decimals=0, use_seps=False)
    #.sub_zero(zero_text='nil')
    .data_color(
        domain=[90, 0],
        palette=["white", "orange"],
        na_color="white",
    )
    .tab_source_note(source_note="Add citation")


)

Save the final file

# Save final table
dist_mat_only2.to_csv("dist_mat_final.csv", index= None)