
This README is a description of the files available to download as part of the Neale Lab GWAS of UK Biobank phenotypes.
For a description of the project and details of the analysis, please see http://www.nealelab.is/uk-biobank.
To download GWAS results, see the links in the manifest tab below. At the top of each column in the manifest is a triangle.  Click the triangle and search options become available for that column.  Once you've found the code you are looking for, refer to the "wget command" column for the corresponding wget command to download the relevant results file. 
The code used to generate the files described here is publicly available: https://github.com/Nealelab/UK_Biobank_GWAS. 
Questions or concerns not addressed by this README, the project website, or the Github repository can be directed to nealelab.ukb@gmail.com.


variants.tsv.bgz
================
This file contains annotations on each variant in the GWAS, calculated across the analysis subset of 361,194 samples.

NOTE: The order of variants in this file matches the order of variants in the results files described below. 
      To join these annotations with a results file, either match on the "variant" field or simply paste the 
      columns together (e.g. "paste variants.tsv K50.gwas.imputed_v3.both_sexes.tsv").

variant                     string      Variant identifier in the form "chr:pos:ref:alt", where "ref" is aligned to the forward strand
                                        of GRCh37 and "alt" is the effect allele (use this to join with results files).
chr                         string      Chromosome of the variant.
pos                         int         Position of the variant in GRCh37 coordinates.
ref                         string      Reference allele on the forward strand.
alt                         string      Alternate allele (not necessarily minor allele).
rsid                        string      rsid (not guaranteed to be unique).
varid                       string      Unique variant identifier included in imputed BGEN files.
consequence                 string      Consequence annotated using VEP version 85.
consequence_category        string      Category of VEP-annotated consequence ("ptv", "missense", "synonymous", "non_coding").
info                        float       Imputation INFO score as provided by UK Biobank.
call_rate                   float       Call rate (calculated using hardcall genotypes).
AC                          int         Allele count (calculated using hardcall genotypes).
AF                          float       Allele frequency (calculated using hardcall genotypes).
minor_allele                string      Minor allele (equal to ref allele when AF > 0.5, otherwise equal to alt allele).
minor_AF                    float       Minor allele frequency (calculated using hardcall genotypes).
p_hwe                       float       Hardy-Weinberg p-value.
n_called                    int         Number of samples with defined genotype at this variant.
n_not_called                int         Number of samples without a defined genotype at this variant.
n_hom_ref                   int         Number of samples with homozygous reference genotype at this variant.
n_het                       int         Number of samples with heterozygous genotype at this variant.
n_hom_var                   int         Number of samples with homozygous alternate genotype at this variant.
n_non_ref                   int         Number of samples with non-homozygous reference genotype at this variant (n_het + n_hom_var)
r_heterozygosity            float       Proportion of samples with heterozygous genotype at this variant.
r_het_hom_var               float       Ratio of samples with heterozygous genotype to samples with homozygous alternate genotype at this variant.
r_expected_het_frequency    float       Expected r_heterozygosity based on Hardy-Weinberg equilibrium.


samples.{both_sexes,female,male}.tsv.bgz
========================================
These files contain the plate name and well of each sample included in our GWAS. You can use these to subset the generic sample QC file
provided by UK Biobank, and match the samples to the IDs specific to your application.

plate_name    string    Plate on which the sample was processed.
well          string    Well in which the sample was processed.

european_samples.tsv.bgz
========================
This file contains the plate name and well of each sample remaining in the analysis after removing ancestry outlier samples based on
a custom principal components analysis.

plate_name    string    Plate on which the sample was processed.
well          string    Well in which in the sample was processed.


phenotypes.{both_sexes,female,male}.tsv.bgz
===========================================
These files contain a description and summary of each phenotype included in the analysis.

phenotype               string      Unique phenotype identifier. Format differs depending on the source of the phenotype.
description             string      Free text description of the phenotype.
variable_type           string      {"categorical", "ordinal", "continuous_irnt", "continuous_raw"} Variable type. Each continuous variable
                                    has two versions: an untransformed version ("continuous_raw") and a version where values have been inverse
                                    rank normalized ("continuous_irnt").
source                  string      {"icd10", "finngen", "phesant"} Source of the phenotype. See notes below.
n_non_missing           int         Number of samples within the analysis subset defined for this phenotype.
n_missing               int         Number of samples within the analysis subset not defined for this phenotype.
n_controls              int         For case/control phenotypes, number of control samples within the analysis subset.
n_cases                 int         For case/control phenotypes, number of case samples within the analysis subset.
PHESANT_transformation  string      This field describes the transformations performed by PHESANT for the applicable phenotypes.
notes                   string      Any additional notes. 

Analysis subset sizes:
- both_sexes:   361,194 samples
- female:       194,174 samples
- male:         167,020 samples

Phenotype sources:
- icd10:    These phenotypes were generated from UK Biobank fields 41202-0.0 - 41202-0.379. For each sample, the set of ICD10
            codes (truncated to the first three characters, e.g. "K50") included in these fields was collected. The ICD10 phenotypes 
            are booleans indicating whether the ICD10 code is included in that set of codes for each sample.
- finngen:  These phenotypes were manually curated by collaborators in the FinnGen research project. Many are combinations of different ICD10 codes.
- phesant:  These phenotypes were automatically processed using a modified version of the software PHESANT (https://www.ncbi.nlm.nih.gov/pubmed/29040602).


<phenotype_code>.gwas.imputed_v3.{both_sexes,female,male}.tsv.bgz
=================================================================
These are the GWAS results files (e.g., "K50.gwas.imputed_v3.both_sexes.tsv.bgz").

variant                           string      Variant identifier in the form "chr:pos:ref:alt", where "ref" is aligned to the forward strand
                                              of GRCh37 and "alt" is the effect allele (use this to join with variant annotation file).
minor_allele                      string      The minor allele (alt allele is not always minor).
minor_AF                          float       Frequency of the minor allele in the n_complete_samples defined for this phenotype.
expected_case_minor_AC            float       (Optional) For case/control phenotypes, calculated as (2 * minor_AF * n_cases).
expected_min_category_minor_AC    float       (Optional) For categorical phenotypes with less than 5 categories, 
                                                         calculated as (2 * minor_AF * number of samples in smallest category).
low_confidence_variant            boolean     Flag indicating low confidence results based on the following heuristics:
                                              Case/control phenotypes: expected_case_minor_AC < 25 or minor_AF < 0.001.
                                              Categorical phenotypes with less than 5 categories: expected_min_category_minor_AC < 25 or minor_AF < 0.001
                                              Quantitative phenotypes: minor_AF < 0.001.
n_complete_samples                int         Number of samples defined for this phenotype.
AC                                float       Allele count of alt allele calculated on dosages within n_complete_samples.
ytx                               float       Dot product of phenotype vector y and genotype vector x (alt allele count in cases for case/control phenotypes).
beta                              float       Estimated effect size of alt allele.
se                                float       Estimated standard error of beta.
tstat                             float       t-statistic of beta estimate (= beta/se).
pval                              float       p-value of beta significance test.
