Interactions between Polygenic Scores and Environments: Methodological and Conceptual Challenges

Interest in the study of gene–environment interaction has recently grown due to the sudden availability of molecular genetic data—in particular, polygenic scores—in many long-running longitudinal studies. Identifying and estimating statistical interactions comes with several analytic and inferential challenges; these challenges are heightened when used to integrate observational genomic and social science data. We articulate some of these key challenges, provide new perspectives on the study of gene–environment interactions, and end by offering some practical guidance for conducting research in this area. Given the sudden availability of well-powered polygenic scores, we anticipate a substantial increase in research testing for interaction between such scores and environments. The issues we discuss, if not properly addressed, may impact the enduring scientific value of gene–environment interaction studies.

of many variants across the genome and aim to index an individual's genetic liability for a given trait. PGSs constructed from large GWAS are robustly predictive of a sizable proportion of variance in consequential outcomes, such as educational attainment and lifespan (Cesarini and Visscher 2017;Lee et al. 2018;Sugrue and Desikan 2019). 1 In fact, many PGSs are predictive of important biobehavioral and social science outcomes that were not the target of the original GWAS. Although PGS are neither pure (they may capture, e.g., correlated nongenetic factors [Morris et al. 2020]) nor universal (i.e., they may not generalize to environmental contexts not captured in the original GWAS from which they were constructed [Mostafavi et al. 2020]) measures, they have still sparked substantial interest. Many have argued that PGSs may advance our understanding of the behavioral and biomedical sciences (Belsky and Harden 2019;Conley and Fletcher 2017;Dudbridge 2016;Harden and Koellinger 2020). Sociologists, in particular, have begun to offer frameworks for thinking about how the discipline may benefit from such work (Freese 2018;Mills and Tropf 2020).
The increasing adoption of genetic approaches in social and behavioral science research has not diminished interest in the environment. Indeed, how social and environmental factors combine and interact with biological factors to produce individual differences is a question at the forefront of many research agendas in the social and behavioral sciences. Researchers have long posited that genetic effects likely vary as a function of environment (Feldman and Lewontin 1975). 2 For example, in the twin study literature, there has been substantial interest in whether decompositions of observed variation in a phenotype into genetic and environmental components differ by socioeconomic context or age (Purcell 2002). Other research designs have tested interactions between measured genotypes (i.e., individual genetic variants) and environmental features. Although such an approach has intuitive appeal, it has proven technically challenging to implement (Duncan and Keller 2011). Some of the challenges of this research agenda may be attributable to unrealistically large expectations for effect sizes of individual variants and thus circumvented through use of PGSs. Yet, even when gene-environment interaction (GxE) results are robust and replicable, the interpretational and practical implications of such research can be unclear.
Polygenic scores are rapidly becoming widely available. Data sets such as the Health and Retirement Study (HRS;Ware et al. 2017), Add Health (Braudt and Harris 2018), and the Wisconsin Longitudinal Study (Okbay, Benjamin, and Visscher 2018) are posting preconstructed scores for use by researchers, and catalogs of polygenic scores are being made available (Lambert et al. 2020). This novel data resource may offer new and more robust avenues for exploration of GxE. However, challenges remain. Given the emergence of this new tool, we aim to provide timely guidance on how to conduct high-quality GxE research using PGSs. In this article, we have two main objectives. First, we outline several concerns associated with performing GxE research that future work may benefit from considering. Second, we offer some guidelines for designing, implementing, and interpreting high-quality GxE research using PGSs.

The Standard GxE Model
We consider some outcome, ϒ, to be a function of an individual's genotype, G, and some (potentially continuously varying) environmental exposure, E. We generically describe this data-generating model as (1) Equation (1) accommodates both complex interplay between genotype and environment as well as outcomes that are not normally distributed (e.g., ϒ may have a Bernoulli distribution). We supplement this simple model with a few crucial assumptions. We assume that we have reasonable proxies available for G and E and some identifiable approximation to f (). We comment on each of these assumptions below.
With respect to G, we assume that we can characterize genetic influence on the trait as a PGS, that is, a sum wherein the number (N) of alleles (0, 1, or 2) that an individual i has for each single nucleotide polymorphism (SNP) j is weighted by the effect, β j , identified via GWAS. We note a few assumptions implicit in the above. We are focusing on traits that have a genetic architecture appropriately characterized by effects that are additive with respect to one another (although they may be nonadditive in terms of their potential to interact with environmental contexts) and dispersed over many loci. We view the assumption of additivity as an acceptable simplification given both the success of additive GWAS and the relative lack of strong empirical support for dominance or epistasis (i.e., gene-gene interaction) models (Polderman et al. 2015).
The assumption that genetic effects spread over many loci is not especially restrictive. Empirical work has indicated that many traits of interest in population health-body mass index, cardiovascular disease-are highly polygenic. Not all traits necessarily have this characteristic; consider, for example, monogenic diseases such as cystic fibrosis. However, as the sample size of the GWAS used to generate a PGS increases, weights (i.e., β j ) for SNPs that are not relevant to the phenotype of interest will go toward zero; thus, a summative approach can still potentially be used in such cases. Moreover, much of our discussion still applies when using genetic predictors constructed from a smaller number of variants or even using a single variant allele count (e.g., Boardman et al. 2012;Rosenquist et al. 2015).
We also note that GWAS results (i.e., β j ) are themselves potentially a function of both trait-specific biology and contextual features of the data used to derive them: for example, the social and policy landscape governing behavior of participants in the GWAS, selection issues associated with being a part of GWAS sample, et cetera. (Mostafavi et al. 2020;Pirastu et al. 2020). PGSs index the genetic propensity within the environmental context and demographic characteristics of participants in the original discovery GWAS on which the PGS is constructed. 3 An interaction between PGS and environment may then indicate that the influence of genetic factors on the outcome is larger in some environments than others, that the sample in one environment is more similar to the sample from the discovery GWAS than in others, or some mixture of the two. This ambiguity regarding interpretation is important to keep in mind when findings from polygenic score research are interpreted. However, we focus the current article on inferential and statistical issues pertaining to the samples in which the PGSs are constructed and analyzed (i.e., we do not focus on the potential mismatch between that sample and the GWAS discovery sample).
With respect to E, we assume that researchers use specific measures of the environment, which we denote ENV. At present, research typically focuses on variation in measured environments that have relatively large main effects on ϒ. We consider this topic in detail later. In general, we emphasize that there are numerous challenges associated with identification of the appropriate ENV in GxE research (Boardman, Daw, and Freese 2013). The identification of appropriate ENV measures merits additional scrutiny in future work. Following selection of a candidate ENV, more questions follow. Are we measuring the environmental characteristic at the appropriate level (e.g., household vs. neighborhood vs. community)? Are we measuring a salient exposure given the respondents' ages? Can we measure the environmental exposure of interest with high fidelity? Are the exposures and contexts of interest correlated with other, unmeasured, environmental or genetic variables that are themselves the driving forces in the identified GxE?
Finally, we assume that the unknown function f () is well approximated by a relatively simple model. In particular, many GxE studies aim to shed light on Equation (1) using regression models of the form E(ϒ | …) = b0 + b1 ⋅ PGS + b2 ⋅ ENV + b3 ⋅ PGS ⋅ ENV + covariates . (2) The aim is to have Equation (2) elucidate key properties of the (unknown) data-generating process, even if Equation (2) is only a rough approximation of Equation (1). There are several concerns that apply to such regression models. We review two important issues that have been the subject of previous scrutiny below before then considering several novel issues of specific relevance when conducting GxE studies in the next section.
First, environmental exposures are typically partly endogenous (Jaffee and Price 2007), creating complex patterns of correlations between genes, focal environments, and other relevant exposures that lead to inferential challenges for the identification of GxE. We do not provide an in-depth treatment of this issue here as it has been discussed in depth elsewhere (Briley et al. 2019;Dudbridge and Fletcher 2014;Fletcher and Conley 2013). The question of endogeneity is, of course, closely related to the question of whether the measured environment that statistically moderates PGS effects has a causal effect. This is of course a crucial question; whether the effect is causal has direct implications for whether direct manipulation of that environment will produce changes in the genotypephenotype association. Second, misspecification bias is a generic problem that introduces additional complexities in the case of interaction research. For example, care must be taken to distinguish between models containing interactions between two variables versus those with no interactions but nonlinear (e.g., quadratic) terms in one or both of the two variables (Lubinski and Humphreys 1990;MacCallum and Mar 1995). In particular, GxE research must also attend to the issue raised by Keller (2014) focusing on the covariates included in Equation (2). When covariates are included in Equation (2), specification error may result if additional interaction terms between the covariates and both E and the PGS are not included. This is because the main effects of the covariates are insufficient controls in the case where there is covariation between both the covariate and genotype or the covariate and the environment. Fortunately, there is a straightforward solution. Researchers simply need include the full suite of interaction terms between the PGSs and the covariates when estimating Equation (2).

Study Design Issues in GxE Research
The Environmental Exposure The problem.-A great deal of research in the social sciences focuses exclusively on the effects of environments. For example, there is substantial interest in the effects of poverty, reflected primarily in the home environment of a young child, on the developing brain and related cognitive functioning (Duncan and Magnuson 2012;Johnson, Riis, and Noble 2016). GxE research has tended to emphasize environmental variables, like poverty, for which large main effects have been well documented (Barr et al. 2018;Gould et al. 2018;Musci et al. 2019). However, the environmental features having large main effects need not also be the features that lead to nuanced GxE effects. GxE research may benefit from additional attention to the theorized nature of the candidate environmental variables deployed in GxE research.
To better frame our argument, we consider two stylized patterns of GxE interaction. We emphasize here that these two patterns are not an exhaustive taxonomy of GxE interaction. Rather, they serve as illustrations of the considerations that we encourage. First, consider GxE interactions in which the environmental functions as a "dimmer" on genetic effects. Dimmers, as in switches responsible for dimming or brightening lights, may magnify or constrict genetic effects on an outcome without changing their sign. Investigating dimmertype GxE may be of high substantive interest in many contexts. For instance, it is of strong practical and theoretical importance to determine whether an educational policy with a robust positive average effect for the population disproportionally benefits children at highest genetic risk or those at lowest genetic risk or has uniform effect across the spectrum of genotypes.
However, as we discuss at greater length in the subsection on coarsened outcome variables, it is also important to be vigilant about the potential for GxE to arise as an artifact of more general effects on the distribution of the observed outcome itself. For instance, suppose the educational policy of interest is associated with an appreciable increase in both the mean and variation of math achievement in the student population. It is then possible that the intervention has increased the effect of the PGS on math achievement (i.e., a positive b3 estimate in Eq. [2]) simply as a byproduct of more general increases in variation in math achievement. Because conventional ordinary least squares methods are blind to this type of heteroscedasticity, the concomitant increase in non-PGS variance may go overlooked.
Second, consider GxE interactions in which the environment functions as an image-inverting "lens" on genetic effects. An environment acts as such a lens when the direction of the effect of the PGS differs across the range of that environment. We refer to these environments as lenses based on the optical notion of a lens; in particular, certain glass lenses invert the orientation of objects. 4 When considering lenses, the relative effect of a given genotype may be positive for a "low" level of the relevant environmental exposure and negative for "high" levels of the exposure, or vice versa. This has led to the hypothesis that what qualifies as a high-or low-risk genotype may depend upon the environmental context (Belsky and Pluess 2009;Ellis et al. 2011;Obradović and Boyce 2009). Note that an environment may function as a lens even when it has a limited main effect.
Researchers frequently conceptualize environments to operate as lenses as a theoretical motivation for doing GxE research, and yet, in practice, many of the environmental measures typically used in GxE studies may be conceptually closer to the dimmer category. Moreover, the selection of PGS effects for examining lens-type GxE may be particularly challenging given that we construct PGSs from GWASs that only include main effects of SNPs (although this is perhaps changing in ways we expand upon below). This limitation may act as a strict limiting factor when it comes to identifying GxE with polygenic scores. A related issue is that, if the environmental context of the participants in the GWAS sample used to construct the PGS is similar to that in the test sample used to estimate GxE, then it is unlikely to include SNPs that demonstrate lens-type patterns, as the main effects of these SNPs will be close to zero.
We can also understand the difference between a dimmer and a lens in terms of their effect on the rank ordering of outcomes. All else being equal, a dimmer is order-preserving; that is, it preserves the order of the genotypes at different levels of the environment. Variation in the dimmer serves to vary the distance, in the outcome metric, between different levels of the PGS but never changes the rank orderings of the levels. In contrast, a lens reverses the order of genotypes; a PGS that predicts an outcome near the top of the distribution at one level of environmental exposure will predict an outcome near the bottom at another level of environmental exposure. Our dimmer/lens typology is similar in many respects to the ordinal/disordinal typology previously suggested (Widaman et al. 2012) but may be a useful conceptual distinction as GxE research becomes more common in the social sciences.
Recommendations.-Conceptually, researchers will benefit from being attentive as to the form of GxE they expect; for example, is the candidate environment expected to operate as a lens, as a dimmer, or according to some more complex functional form? In our experience, GxE researchers will tend to observe that environments with large main effects on a phenotype act as dimmers. Such environments will moderate the magnitude of the effect of the polygenic score on the outcome without changing its sign. Although these observations may be of value, they need to be distinguished from the more dramatic patterns of sign reversal of PGS effects in different environments that have received a great deal of conceptual attention. In studies seeking to identify lens-type patterns (Troth et al. 2018), both the genetic and the environmental components are of crucial importance for testing hypotheses in which the environmental context determines whether a given genotype is risky or advantageous.
Analytically, we offer several suggestions that might be of interest in future work. We first emphasize the potential for analyses that take advantage of environmental variation without identifying a specific environmental feature of interest. In situations wherein individuals cluster into some unit, researchers may first want to consider the level of empirical support for GxE based on relatively omnibus measures of the environment. For example, one might test for variation observed in the relationship between phenotype and polygenic score across environmental units (e.g., schools or census tracts); see Trejo et al. (2018) for one such example. Such analyses are informative in that they offer preliminary guidance on whether specific features of the environments deserve additional scrutiny as possible GxE targets.
Yet another approach that researchers might want to consider involves analyses of heritability and genetic correlation as captured by genomic techniques (Grotzinger et al. 2019;Yang et al. 2011). Polygenic scores collapse information from across the genome into a composite designed to predict a specific outcome in a novel data set. Because PGSs are constructed using a large number of GWAS regression weights that themselves are estimated with error, PGS prediction is biased downward in novel samples (and biased upward in the original GWAS samples). In contrast, genomic heritability and genetic correlation estimates are constructed using methods related to mixed effects modeling and are unbiased by measurement error. Such analyses can be used to, for example, study changing patterns of heritability (Tropf et al. 2017) across environments. Although analyses of heritability and genetic correlation do not provide scores for individual participants (because they estimate random effects to represent population variation, rather than individual estimates ]), they can still offer information about the way that genotypes are related to phenotypes.
We also note the increase in methodologies focused on identifying genetic variants that are associated with the amount of variation in the outcome rather than strictly the level Wang et al. 2019;Yang et al. 2012;Young, Wauthier, and Donnelly 2018). Such approaches are generating data that may be useful in future GxE work. A natural question to ask of the genetic variants identified in such studies is whether environments interact with such variants to further modulate variation in the outcome. Although such approaches will presumably also involve novel methodological challenges, they are an exciting new resource that could be used to study gene-environment interplay.

Coarsened Outcome Variables
The problem.-Characteristics of the distribution of ϒ may have crucial implications for conducting GxE studies. When ϒ is a discrete outcome coarsened from an underlying continuous variable, researchers encounter an opportunity to mis-interpret affirmative findings of GxE. For simplicity of exposition, we focus on the simplest case where ϒ is dichotomous (though the phenomenon extends to coarsened variables that take more than two values). Suppose a dichotomous outcome ϒ is a coarsened version of some continuously varying latent indicator ϒ * (so ϒ = 1 if ϒ * > λ for some scalar λ and 0 otherwise). For example, ϒ might be obesity or college completion (in which case ϒ * would be body mass index or years of schooling, respectively). Suppose we estimate Equation (2) with ordinary least squares using ϒ * instead of ϒ and yield a nonzero and statistically significant b3. How should we interpret such a finding? One possibility is that a finding of GxE suggests differences in the slope of association between G and ϒ * . This, we argue, is what researchers generally have in mind when conducting studies testing for GxE. However, a second possibility is that a purely environmental shock may shift the intercept of the association line between G and Y * , thus resulting in a GxE finding (i.e., a nonzero and statistically significant b3) with different interpretation.
We illustrate the basic problem in Figure 1. When we examine relationships between PGS and outcome in the context of the continuously measured version-ϒ * in Figure 1A-we observe a constant linear association with genotype across two environments. However, when we observe a dichotomized version of the outcome-ϒ in Figure 1B-we have a relationship that is more challenging to interpret. In particular, Figure 1B suggests GxE when a linear probability model is used (i.e., the dotted curves are not parallel). In contrast, when a logistic regression model is used, we obtain unbiased estimates of GxE (i.e., b3) but they may suffer from low power (and large confidence intervals) due to low variability in the dichotomized outcome at some regions of the environmental measure. This problem may be even more severe when gene-environment correlation results in a large shift in the distributions of the PGS along the range of the environmental measure.
Findings such as those in Figure 1B are worth noting, and they may be highly relevant in cases where the continuous ϒ * is of less interest than the dichotomized ϒ (e.g., college completion may well matter more than years of schooling) or when ϒ * is latent. However, we also need not confuse matters by misunderstanding the nature of the associations in question. If findings are driven by differences in intercepts and relatively consistent slopes, as in Figure 1A, this is important information to report. We expect that GxE research will benefit from distinguishing between these two possibilities; see also our discussion of this issue in an empirical context elsewhere (Trejo et al. 2018). 5 Recommendations.-When research uses coarsened outcome variables due to substantive interest in the coarsened outcome themselves (e.g., credentials, obesity indicators), sensitivity analyses that probe the issue considered here based on the underlying (noncoarsened) variable are essential. Such analyses will help to better contextualize findings from coarsened variables. In analysis of binary outcomes for which no underlying continuous variable is available (i.e., case-control status), utilization of multiple methods, such as both logistic and linear probability models, may be used to probe for sensitivity of the results to the functional form of the model. This will be especially important when the environment is itself nontrivially correlated with the outcome under study.
Although we do not focus here on coarsened outcomes that are nonbinary (e.g., ordered categorical, nominal, or censored/truncated outcomes), we note that many of the concerns raised here would be of relevance in those cases as well. At a minimum, sensitivity analyses probing the persistence of findings across a range of model specifications may be valuable. For example, in an analysis of the highest math course taken by high schools students ), a variety of models-cumulative link, adjacent-category logit, locally estimated scatterplot smoothing (LOESS) based on dichotomizations-were used in an attempt to interrogate potential differences in course as a function of genotype when stratified by school socioeconomic status. GxE analysis in the context of such coarsened outcomes is likely to be challenging; future work describing methodological best practices in this domain would be welcome.

Measurement Error
The problem.-Measurement error acts both to bias associated parameter estimates toward zero (Hutcheon, Chiolero, and Hanley 2010) and to distort power calculations. In the specific context of GxE studies, there are several concerns. Measurement error exists in both the operationalized PGS and ENV variables of Equation (2). Measurement error in G, which results from imprecise estimates of the GWAS betas used to construct the PGS, has received some attention (Conley et al. 2016;DiPrete, Burik, and Koellinger 2018;Tucker-Drob 2017). However, less attention has been paid to measurement error in E. Homoscedastic measurement error in E has implications for power (matters may be further complicated in the presence of nonhomoscedastic measurement error, but we focus on the simpler case here). Figure 2 is a simple illustration of this via a simulation study. 6 We assume that we measure both the PGS and the target environmental variable with error. We focus on variation in the reliability of the environmental measure (the x axis) and choose two levels of reliability (which we index as alpha) of 0.25 (on left) and 0.5 (on right) for the PGS; we view these reliabilities as representative of relatively weak and relatively strong polygenic scores given existing GWAS. The main takeaway is that ignoring measurement error with respect to the environment inevitably leads to inflated power calculations.
Let us first focus our attention on a PGS with relatively high reliability by current PGS standards (alpha = 0.5) in the case where we have 1,000 respondents. We first assume that there is no error in our environmental measure (region emphasized in gray rectangle). In such a case, power is below standard levels of acceptability (power = 0.8). As the reliability of our environmental measure declines, however, power becomes increasingly poor. Even when the environment is measured with decent reliability (alpha = 0.7), power is greatly reduced (power = 0.4). In the case where the PGS is of lower reliability, power is even worse (power = 0.2 for an environmental measure of reliability alpha = 0.7). When the PGS is measured with substantial error (alpha = 0.25), even relatively large samples (when considering population-based studies) of N = 10,000 will suffer from power limitations when the environment is also measured with substantial error. These calculations are based upon a toy model that might not be relevant in all cases, but given that interaction studies are power-hungry even without considering measurement error (McClelland and Judd 1993), our model emphasizes the need to carefully consider whether one has reasonable power before conducting GxE studies.
Recommendations.-We recommend that power analyses be the norm (and not the exception) in GxE research. Traditional power analyses are used to inform key design features, such as the sample size, prior to the implementation of a study. In contrast, power analyses of the type considered here offer information about the power of a study design given existing data (e.g., the sample sizes available from large longitudinal studies such as the HRS and Add Health) and key assumptions about the relevant parameters. In particular, power analyses specifically designed to offer insights into the level of power available given measurement error in both the polygenic score and the environment would be valuable additions when planning analyses of data that are already available. As Figure 2 illustrates, a failure to consider measurement error can lead to inflated estimates of power. Even for samples of several thousand, GxE analyses will be weakly powered absent highly penetrant genetic predictors or environments measured with little noise. Such power analyses are not cure-alls; rather, they hopefully help researchers to better understand the limitations that they face-specifically, the likelihood of observing false positives-in a given context.

Sample Selection Processes and Internal and External Validity
The problem.-Selection processes complicate inference in observational settings in a number of ways, and studies of GxE are no exception. An often-underappreciated point is that sample selection issues threaten both external and internal validity. We discuss several (potentially overlapping) types of selection that are particularly relevant for GxE research. These sample selection processes limit the population to which GxE findings can be generalized and may lead to spurious results via collider bias (Elwert and Winship 2014). Notably, sample selection may pose a threat both in the discovery GWAS used to identify the betas needed to construct a PGS and in the prediction sample in which the PGS is actually constructed and used to estimate GxE.
We begin with mortality selection. Such selection occurs when a nonrandom subset is lost to mortality and therefore not observed. In studies of older respondents (e.g., the HRS), mortality selection tends to make the resulting sample "healthier, wealthier, and wiser" (Zajacova and Burgard 2013). Mortality selection is especially relevant to GxE research because genotyping is a relatively recent technology; participants in longstanding studies needed to survive long enough to make it into the genotyped subsample. Indeed, there is evidence to suggest that GxE findings may be sensitive to the presence of mortality selection (Domingue et al. 2017). When studying health-related traits, especially in older populations, we need to consider mortality selection's role in shaping findings (Oliynyk 2019). In scenarios wherein mortality can be readily modeled with existing data, one possible analytic solution is to use inverse probability weighting (van der Wal and Geskus 2011) to correct for the role of mortality selection. A related issue is that individuals with certain genetic profiles-for example, those with high genetic liabilities for schizophreniamay be underrepresented in various data sources (Martin et al. 2016;Meisner, Kundu, and Chatterjee 2019;Pirastu et al. 2020;Taylor et al. 2018). Such selection can also lead to issues of both bias and generalizability in subsequent studies.
A second issue is that demographic factors play a role in who gets included in genetic studies. This, in turn, has implications with respect to the populations to which results using genetic subsamples may generalize. Of particular note is the massive overrepresentation of European-descent individuals in both GWAS (Mills and Rahal 2019) and PGS (Duncan et al. 2019) studies. This problem is due to several factors, including both the overrepresentation of European-descent individuals in genetic studies and the fact that differences in linkage disequilibrium across groups leads to the GWAS findings performing better in the (predominantly European) samples from which they are derived. Efforts (Mills and Rahal 2020) are underway to monitor (with the hope of then remedying) this problem. In the meantime, researchers have noted that adoption of polygenic scores in precision medicine may exacerbate preexisting health disparities (Martin et al. 2019). A focus on homogeneous samples may lead to issues in GxE if it either severely constricts the relevant artificial variance or even potentially undermines the theoretical motivation suggesting a particular research question (which may necessitate a more diverse sample). In any event, equity concerns need to be in the foreground of genetics research; GxE is no exception.
These selection problems offer both internal and external validity threats to GxE studies that are important to consider carefully. An additional concern is that nonrandom selection into the analytic samples used in empirical studies may lead to reduced environmental variation further challenging attempts to make accurate inferences regarding GxE. As an illustration, we consider two key adolescent environments-the socioeconomic circumstances of home ) and the disadvantage of one's residential neighborhood (Belsky et al. 2019), both from Wave I of Add Health (Harris et al. 2019)-that may be of interest. As a function of the way the analytic sample becomes a selected portion of the full sample, we observe a decrease in environmental variance. These decreases will lead to even further reductions in our power to detect GxE effects; in particular, power analyses motivated by environmental variation observed in the full sample are likely to overstate true power given that empirical work will then take place with reduced environmental variation. Beyond power concerns, such selection can lead to a reduction in density in certain regions of the distribution of the measured environment that will increase the challenge of identifying the relevant functional form in that region.
Recommendations.-Issues concerning selection require careful attention. Figure 3 suggests that they may have implications that need to be accounted for in other aspects of study design (i.e., are power analyses based on the appropriate quantities?). We further suggest two ways that research may approach these issues. First, the selection issues discussed here have implications for generalizability. Some forms of this problem are obvious. Given, for example, the problems of analysis in ancestrally heterogeneous samples and the subsequent work on samples of relatively limited genetic diversity, it would be imprudent to interpret GxE findings from such a study as applying in the broader population containing a fuller spectrum of genetic diversity. But it may also be the case that selection introduces other factors that limit generalizability. For example, long-lived smokers may be quite different from the general population (Levine and Crimmins 2014); inference based on such samples may be misleading.
Second, on the analytic side, attempts to model the relevant selection processes may lead to direct insights into the degree of generalizability of patterns. For example, researchers may examine how results change when using formal techniques that correct for selection (e.g., inverse probability weighting [Cole and Hernán 2008]). Even less comprehensive analyses of selection processes may lead to insights about the nature of the analytic sample and offer guides to generalization that researchers can communicate alongside the relevant empirical results.

Conclusion
GxE characterizes both the environmental contingency of genetically linked processes and the genetic contingency of environmentally linked processes. In our view, GxE studies involving human behavior and polygenic scores may offer valuable insights but are also at risk of repeating many of the mistakes made by previous eras of research (e.g., the candidate gene era). Our goal has been to emphasize the need for careful thinking about the rationale and methods underlying investigations of GxE.
In particular, we highlighted four issues-selection of the relevant environment, analysis of coarsened outcomes, the role of measurement error, and issues of sample selection-that deserve additional scrutiny in future research. We also attempted to offer recommendations for beginning to address each problem. We readily acknowledge that ours are a relatively modest set of recommendations that will not fully resolve the vast range of analytic and inferential challenges associated with GxE research with PGSs.
An overarching goal of research examining the combined genetic and environmental contributions to human behavior is to help construct useful models of human behavior. In our view, useful models avoid unnecessary complexity when accounting for messy data.
At its best, GxE research can help inform the construction of such models by parsimoniously showcasing complexities from empirical reality that need to be accounted for. For instance, GxE research can help reveal important heterogeneity in developmental processes, treatment responses, and policy effects. To be informative, however, we must exercise care. Otherwise, GxE research threatens to introduce confusion into the already challenging landscape of social and behavioral science research.

1.
Exactly how predictive a PGS is of a given trait depends on both on the trait's heritability and the sample size of the GWAS used to derive the effect size estimates; see Figure 2 of Harden and Koellinger (2020).

2.
We note that one could alternatively discuss environmental effects differing as a function of genetics; we utilize the original formulation in this article but note that the latter may occasionally be the more germane.

3.
In practice, polygenic scores may contain information on correlated nongenetic factors (e.g., population stratification and dynastic effects like genetic nurture) in addition to true genetic risk (Morris et al. 2020).

4.
Specifically, convex lenses have such image-inverting properties. Here we use "lens" as shorthand for convex lens but note that concave lenses do not have this property.