Twentieth Century Intercohort Trends in Verbal Ability in the United States

Vocabulary test score trends from the General Social Survey contradict the widespread conclusion that scores on standardized intelligence tests have systematically increased over the past century. We use a vocabulary test included in 20 nationally representative surveys administered since 1974 to test three hypotheses proposed to account for these trends, including changes in the formal measurement properties of the test, over-time changes in the meaning of education, and intercohort differences in exposure to words on the test. We find no support for the idea that test scores have declined because of changes in the structure of the test. Instead, our results show that education selectivity accounts for some cohort differences among prewar cohorts and that cohort-specific differences in exposure to words on the test account for nearly all variation in vocabulary scores of respondents born after 1945, suggesting different causal processes have influenced cohort verbal ability during distinct historical eras.

R ESULTS from standardized intelligence tests indicate that scores have systemat- ically increased over the past century throughout the world, with substantial evidence pointing to generally higher test scores among more recent cohorts across a range of cognitive measures.This historical rise in scores on standardized tests from one generation to the next, often called the "Flynn Effect" (Neisser 1997(Neisser , 1998)), was named for political scientist James Flynn, who discovered and quantified its pervasiveness (Flynn 1984(Flynn , 1987a(Flynn , 1987b(Flynn , 1998(Flynn , 1999)).Results supporting the Flynn Effect square with expectations that increasing levels of education bring about better intellectual performance in a population (Baker et al. 2015)-what has elsewhere been referred to as the progress model of societal cognitive development (Alwin 2009;Alwin and Pacheco 2012).Flynn has argued that the phenomenon of rising IQ scores is not genetic in origin but, rather, has more to do with an environmental change toward greater emphasis on scientific, rational thinking (see Flynn 2007Flynn , 2012)).Others have focused on improvements in health, nutrition, and education (e.g., Tuddenham 1948;Neisser 1997Neisser , 1998;;Lynn 2009).The general consensus in recent years is that the Flynn effect does not reflect genuine changes in intelligence (e.g., g factor; Jensen 1998) but instead reflects the kind of analytic ability at a premium in the developed world (Flynn 2012; te Nijenhuis and van der Flier 2013).
Researchers report generally higher test scores among more recent birth cohorts across a range of cognitive measures (Schaie 2005(Schaie , 2008;;Schaie, Willis, and Pennack 2005), providing further support for expectations that increasing levels of education bring about better population-level performance on cognitive tests.However, it is well known that test scores in the United States have not always risen over time, nor have they risen in all domains.Indeed, during the 1960s and 1970s, there were serious concerns in the United States about the declining performance by the young on measures of verbal ability.Average verbal scores on standardized tests like the Scholastic Aptitude Test (SAT) and the American College Testing Service (ACT) tests declined systematically from the mid-1960s through the mid-1980s (see Harnischfeger and Wiley 1976;Wirtz et al. 1977;Ferguson 1976, Cleary andMcCandless 1976;Congressional Budget Office 1986, 1987;Gardner et al. 1983).
Although there are serious problems with interpreting changes in college admissions test scores as if they reflect true aggregate levels of verbal and quantitative skills of high school students, given the changing composition of the test-taking population (Wirtz et al. 1977), most observers of test score declines concede that net of these compositional shifts, the test score decline was and is real (Blake 1989;Alwin 1991; for a counterargument, see Wilson and Gove 1999).SAT test scores on the verbal dimensions have recently continued to decline, dropping to their lowest point in four decades (Chandler 2011;Layton and Brown 2012).Birth cohort trends in verbal ability derived from the General Social Survey (GSS) vocabulary test score data reproduce the SAT score declines almost exactly with respect to their direction and timing (see Alwin 1991;Alwin andMcCammon 1999, 2001;Glenn 1994), adding further support to the test score decline thesis but also contradicting the cohort patterns implied by the literature surrounding the Flynn Effect.
A number of plausible arguments have been advanced to account for the declining levels of vocabulary knowledge among post-World War II birth cohorts in the United States, which we only briefly mention here.Explanations have ranged from changes in societal values (Jencks 1979), declines in motivation and effort in school (Winter 1977;Menard 1988), increased time watching television and declines in reading (Wirtz et al. 1977;Schramm 1977;Morgan 1986;Glenn 1994Glenn , 1999)), changes in family structure (Zajonc 1976), increased substance usage among students (Wirtz 1977), simplification of curriculum (Chall 1983(Chall , 1996;;Chall and Conard 1991;Hayes, Wolfer, and Wolfe 1996;Stedman 1996), and aging (Wilson and Gove 1999).We extend this body of work by examining three additional hypotheses: changes in the measurement properties of the test (Malhotra, Krosnick, and Haertel 2007), macrosocial changes in word usage (Alwin & McCammon 1999), and education selection (Huang 2003).
The first hypothesis that has been advanced to account for the intercohort decline in the GSS data is that the vocabulary test cohort trends (hereafter referred to by the scale name, WORDSUM) are due to changes in the measurement properties of the test (Malhotra, Krosnick, and Haertel 2007).This explanation leverages advances in psychometric methods to consider intercohort differences in item functioning and factorial invariance in WORDSUM.Evidence of either differential item functioning or measurement invariance between birth cohorts might explain the observed trends.
A second hypothesis advanced to account for the intercohort WORDSUM decline in the post-1940s birth cohorts posits that some or all of the words used in the GSS vocabulary test (and, presumably many words used in other standardized tests) have simply gone out of fashion (Alwin 1991:628).The declining prevalence of WORDSUM test items, which we here refer to as the word exposure hypothesis, A third argument that is relevant to intercohort variation in vocabulary knowledge-but which applies mainly to differences among the pre-1940 birth cohortsargues that the expansion of formal schooling over the course of industrialization has been attended by changes in the degree to which students are "selected" into mass schooling.Variation in certain respondent characteristics known to be associated with verbal ability, including individual genetic factors, mortality risk, and socioeconomic status, might select high verbal ability students into formal schooling (Huang 2003).These same selection forces also influence educational attainments.In practical terms, this implies that the meaning of educational attainment has changed as formal schooling became compulsory and universal, such that schooling may not be as selective of high cognitive ability students as it once was.We refer to this phenomenon as the education selection hypothesis.
We systematically investigate these explanations of the intercohort test score decline using three interconnected analytic strategies.In the first step, we conduct a series of statistical tests to address the question of whether between-cohort differences in verbal ability are due to changes in the formal measurement properties of the test. 1 We also establish the intercohort comparability of WORDSUM.We employ the Item Response Theory (IRT) model framework (Embretson and Reise 2000), which is a logit-based form of the common factor model (Glöckner-Rist and Hoijtink 2003), to test changes in the difficulty parameters of the test questions over time and across cohorts.We then use the multi-group confirmatory factor analysis (MGCFA) framework to examine the psychometric properties of the GSS test score.This framework provides a formal test of the invariance of the measurement model and item loadings across cohorts.Not only is factorial invariance an alternate test of change in the measurement properties of the verbal ability index, but it is also a necessary step for establishing the comparability of the test across birth cohorts.
In the second step, we construct a measure of cohort-specific education selectivity and use it in a series of multivariate regression models in which WORDSUM scores are regressed on a measure of education selection and control variables in an effort to account for intercohort differences in verbal ability.
And third, we develop an innovative measure of word exposure drawn from the Google Books corpus (Michel et al. 2011) to assess the effect of intercohort differences in exposure to words in published textual material.We use this measure of cohort exposure in multivariate regression models to understand the extent to which the effect of education selection is diminished or amplified.Before presenting results, we first discuss the theoretical and substantive background to the problem and present a description of our data and measures.

Theoretical Background
For many decades, social and cognitive scientists have argued for potentially enduring differences in the childhood and adolescent socialization experiences of birth cohorts, owing to the impact of historical events and exposures during "critical periods" or "impressionable phases of development" (Mannheim 1927;Ryder sociological science | www.sociologicalscience.com 1965;Schaie 1984).Each new cohort, the argument goes, is exposed to distinctive experiences during childhood and adolescence that are thought to affect a wide range of behavioral and ideational outcomes and continue to have an influence throughout the life course because of processes that promote stability of individual differences (Alwin andMcCammon 2003, 2007).Because of the distinctive and indelible nature of cohort experiences, each cohort has the potential to differ from those that precede and follow it, and such differences have a tendency to persist.As a consequence, cohort turnover is partly responsible for change in the composition of society (Mannheim 1952;Ryder 1965;Firebaugh 1992).From these perspectives, social change is by no means assumed to be entirely because of cohort replacement.Changes that occur within cohorts, including aging and period influences, operate in tandem with changes that occur between cohorts to effect the macrosocial trend in a response variable (e.g., IQ test scores).The twin forces of social change that proximately reside within and between cohorts contribute to an understanding of the dynamic forces that shape individual lives and the composition of society.
One area in which cohort experiences appear to leave enduring effects on young people is in their verbal ability, a phenomenon that has received substantial attention, due in part to the inclusion of a vocabulary test in the GSS (Alwin 1991(Alwin , 2009;;Glenn 1994Glenn , 1999;;Hauser and Huang 1997;Alwin andMcCammon 1999, 2001;Bowles, Grimm, and McArdle 2005;Alwin and Pacheco 2012).Amidst confirming evidence in the test score decline, a longstanding puzzle in the literature on this subject is the existence of a counterintuitive trend toward declining scores on the GSS vocabulary test (WORDSUM), beginning with cohorts born after WWII.By contrast, vocabulary scores steadily increased for cohorts born between 1885 and 1945, thus following the pattern documented by Flynn (2007Flynn ( , 2012)), namely monotonically rising test scores for each new birth cohort.The post-1940s trend in the GSS data is not unique, and as noted, it parallels test score declines during the 1960s and 1970s.
There is ample evidence suggesting general cognitive decline in older age (Hofer and Alwin 2008), but it is important to realize that although most cognitive abilities peak in early adulthood and then decline with age, vocabulary knowledge is unique in that it is one of the few skill sets that remain relatively intact over the adult lifespan (Verhaeghen 2003).Research suggests that vocabulary knowledge peaks in the 50s, with only a very slow decline into older age (Denny 1982;Baltes 1987;Salthouse 1991;Park et al. 1996;Alwin and McCammon 2001;Huang 2003;Verhaeghen 2003;Bowles et al. 2005;Schaie 2005).The general conclusion is that while there are detectable aging effects in the GSS data (Alwin and McCammon 2001;Alwin and Pacheco 2012), their magnitude is insufficient to account for the effects of cohort experiences, especially those witnessed among cohorts born after WWII.Leveraging recent methodological developments involving the use of hierarchical age-period-cohort and ridge-regression models, Yang andLand (2006, 2008; see also Yang et al. 2008) found that cohort effects in the GSS data were robust to the presence of controls for both aging and period factors.They concluded that the intercohort decline in vocabulary test scores in the GSS were real, rather than a methodological artifact (Yang andLand, 2006, 2008; see also Yang et al. 2008).
Research has shown that intercohort results reported for the GSS data reproduce the SAT score declines almost exactly with respect to their direction and timing and  Note: Estimates sample weighted and adjusted for respondent age and education.First and last birth cohort categories were collapsed as follows : 1900 = 1885-1900; 1980 = 1980-1988.also in contradiction to the cohort patterns implied by the literature surrounding the Flynn Effect.In Figure 1, we report the raw, cohort-specific vocabulary test scores derived from 20 social surveys collected from 1974 to 2012 for cohorts born between 1885 and 1988.We report trends using five-year cohort categories, which we superimposed on two additional GSS test score patterns.The first superimposed pattern shows the cohort trends after adjusting for respondent age and education.These data indicate that once verbal ability scores are adjusted for differences in age and level of schooling at time of testing, the overall cohort trend is toward lower vocabulary for each new birth cohort.Another set of patterns in Figure 1 depicts the theoretical expectations derived from the Flynn Effect (linear progress model), a pattern that shows a systematic over-time increase in performance on cognitive tests.Taken together, these trends produce some rather surprising conclusions.First, if one assumes the linear progress model is the correct model for predicting average vocabulary scores over cohorts, the raw scores are consistent with the model up to about 1945 and dramatically incompatible thereafter.Second, once sociological science | www.sociologicalscience.com the cohort-specific test scores are adjusted to constant age and schooling levels of the population as a whole, the earlier-born cohorts have systematically higher scores, and vocabulary test scores decline across cohorts.This set of adjusted scores provides evidence that is completely reversed from expectations based on the Flynn Effect.Although there is likely more than a single explanation for the adjusted test score trends, and quite possibly a different explanation for cohort disparities at different points across the cohort axis (Alwin 2009), the results contradict test score trends based on other measures of verbal abilities (e.g., Wechsler Adult Intelligence Test [WAIS]; Flynn 2010).In the next section, we discuss several hypotheses that have been developed to explain the adjusted WORDSUM test score decline.

Prior Explanations
Here, we briefly review a number of hypotheses that have been proposed to account for the intercohort decline in test scores.Jencks (1979:13), for example, suggested that schools and parents failed to instill proper academic values in children during this period, concluding that students "lost respect for the values of reason."Others have proposed different mechanisms generally related to the same theme-that children of the 1960s and 1970s were exposed to a different learning environment, with less motivation and fewer hours spent in homework (Winter 1977;Menard 1988), increased time spent watching television, and less reading (Wirtz et al. 1977;Schramm 1977;Morgan 1986;Glenn 1994Glenn , 1999)).Zajonc (1976) argued that the changes were due to family configuration (family size, birth order, and birth spacing) brought about by the post-WWII demographic baby boom (see also Alwin 1991).Some also suggested that drug and alcohol use by high school students may have contributed to test score declines during the 1960s and 1970s, but few suggest this as a principal explanation (Wirtz et al. 1977).Education researchers have offered other explanations involving school composition and curriculum.One of these arguments has to do with the systematic "dumbing down" of learning materials in schools, the other an overall tendency toward declining complexity in schooling content (Chall 1983(Chall , 1996;;Chall and Conard 1991;Hayes, Wolfer, and Wolfe 1996;Stedman 1996).Despite these changes, public school apologists have maintained that these test score declines were not a reflection on the job schools were doing (Alexander 1997).

Word Exposure
The exposure hypothesis posits that the growing obsolescence of words used in standardized tests accounts for intercohort vocabulary test score declines (i.e., rising item-level difficulty parameters and pass rates), rather than objective declines in general verbal ability.Because the English language is dynamic and subject to changing usage and centrality in various forms of communication (Brysbaert, Keuleers, and New 2011), word usage varies by place, time, and across generations (Flynn 2012).Variation in the frequency of word usage is positively associated with word recall and test performance (Segui et al. 1982), suggesting that the difficulty of vocabulary test items is associated with word usage (Laufer 1997) and word age (Woodley et al. 2015).If later-born cohorts have not been exposed to the words used in the GSS vocabulary test to the same extent as earlier-born cohorts, it may appear that they have less vocabulary knowledge when, in fact, they simply have different vocabulary knowledge than earlier-born cohorts (Flynn 2012).
The exposure hypothesis has primarily been applied to post-WWII cohorts, in which the effectiveness of the American education system in developing students' basic verbal skills has been questioned (Alwin 1991;Alwin andMcCammon 1999, 2001).For example, an analysis of 800 textbooks used in elementary, middle, and high schools between 1919 and 1991 (Hayes, Wolfer, and Wolfe 1996) reported that words in textbooks became progressively easier after WWII.What they referred to as the "systematic dumbing down" of U.S. reading textbooks was embodied in the daily use of simplified instructional materials that resulted in "a cumulating deficit in students' knowledge base and advanced verbal skill" (Hayes et al. 1996:483).In another test of the exposure hypothesis, Hauser and Huang (1997) conducted detailed analyses of English-language dictionaries over the early part of the twentieth century to measure word use frequencies.Based on their study of word counts and rankings of English-language words in dictionaries for 1921, 1931, 1944, and 1967, Hauser and Huang (1997:345) concluded that decreased exposure "occurs to some degree across the lists of 1921, 1931, and 1944, and it appears strongly when we include the 1967 list in the comparison" (see also Thorndike 1921Thorndike , 1931;;Thorndike and Lorge 1944;Kučera and Francis 1967).While these results are suggestive of an effect of differential exposure to word content on vocabulary knowledge, limitations in both studies leave it an open question as to whether declining word exposure is responsible for declining WORDSUM test scores.
One of the major mechanisms for the transmission of knowledge is formal schooling, and there is a strong relationship between time spent in school and vocabulary knowledge, as shown in all of the research on this topic.However, the education payoff for vocabulary knowledge in later-born cohorts may not be reflected in the words included in common tests of verbal ability, due in part, as we argue here, to declining exposure to the words in the test.The experiences of the post-WWII cohorts, both in school and in the extracurricular world, may have systematically lowered vocabulary knowledge, but this does not account for the relatively higher scores of persons in the earliest-born cohorts.
We leverage recent developments in "culturomics" (Michel et al. 2011) to develop an innovative measure of word exposure that allows us to provide the most rigorous test to date of the word exposure hypothesis.This measure is derived from a corpus of more than 500,000,000,000 words used in English-language books published between 1500 and 2000.This corpus makes it possible to directly measure the central mechanism theorized to cause intercohort differences in vocabulary knowledge, rather than relying on birth year as a proxy for these causal mechanisms.In doing so, we are able to establish a much stronger test of the word exposure hypothesis than has heretofore been possible (see Chall et al. 1977 for an early analysis of test scores using the content of books).Note: Estimates are sample weighted.A three-year moving average was applied to the schooling trend line.Birth cohort reference category (1946)(1947)(1948)(1949)(1950) identified by vertical line.

Selectivity of Education
The GSS data provide evidence of a strong relationship between cohort membership and levels of schooling, based on calculations from 38 years of GSS data spanning the years 1974-2012.As can be seen in Figure 2, systematic cohort differences in educational attainment reveal a strong association between year of birth and completed years of schooling among cohorts born in the first half of the twentieth century.This trend, however, leveled off considerably beginning with cohorts born after WWII, after which cohorts do not appreciably differ in schooling attainments.These trends imply that education was far more selective on factors associated with verbal ability prior to 1950 than afterward.
A high school degree was substantially less common in 1900 than in 2000, suggesting that the meaning of educational attainment has changed as formal schooling became compulsory and universal.Early in the century, only the top students went on to college, whereas now the public is exposed to at least some postsecondary schooling.In essence, the scale that we use to measure educationyears of schooling-may have changed, such that a "year" of schooling measured in, say, 1920 may mean something very different from a "year" of schooling measured in 2000.
This also suggests that comparing individual-level educational attainment across cohorts is insufficient to account for the relationship between formal schooling and vocabulary test scores.Respondent-level variation in characteristics associated with verbal ability, such as latent cognitive abilities, health status, socioeconomic status, and other family background variables, might over-select children with high verbal ability into formal schooling and under-select children with low verbal ability into school (Huang 2003), though some question this assertion (Baker et al. 2015).
Still, despite the over-time rise in educational attainment, the evidence from the GSS cited earlier suggests that education has diminishing returns among more recent cohorts, at least with respect to vocabulary knowledge.If so, then it is necessary to adjust vocabulary test scores not only for respondent education but also by cohort-specific educational attainments.We expect that the potential for education selectivity to account for the test score decline is greatest in the cohorts born before WWII, as this was the period when intercohort differences in educational attainment were largest.

Changes in Test Functioning
Of central concern in this research is the apparent increase in the difficulty of the GSS vocabulary test among more recently born cohorts.Researchers have devised several strategies to measure over-time changes in the difficulty of WORDSUM, but the results are mixed.In their analysis of English-language dictionaries over the early part of the twentieth century, Hauser and Huang (1997) concluded that "WORDSUM has become somewhat more difficult across time, independent of any other change in verbal ability in the general population" (Hauser and Huang 1997344-345).While the Hauser and Huang (1997) results are suggestive, they do not go beyond the measurement of lexical word frequencies, nor do they provide a direct test of the formal measurement properties of WORDSUM.
Another approach is to investigate the psychometric properties of the test items, including the systematic analysis of item difficulties (Malhotra, Krosnick, and Haertel 2007;Beaujean and Sheng 2010;Alwin and Pacheco 2012).This work has shown that WORDSUM item difficulties, measured by the proportion of the population that correctly answers the question, have not appreciably changed.They concluded that "some words in the WORDSUM score are getting easier, some are getting harder, but none have changed substantially; the predominant pattern is toward decreasing rather than rising item difficulty" (p.353).These findings led Alwin and Pacheco (2012) to draw the opposite conclusion from that of Hauser and Huang (1997), indicating that changes in the difficulty of the GSS items was not in the direction of increasing difficulty.Some recent literature suggests the content of the GSS WORDSUM score may be more complex than heretofore acknowledged and that the specific content of the test score needs to be taken into account in assessing the nature of these trends.Bowles  (2005) found that rather than a univocal measure of verbal ability, the GSS test items were composed of two underlying factors: one reflecting general knowledge (basic words) and one reflecting specialized knowledge (advanced words).These findings are compatible with difficulty-related vocabulary factors in other sets of intelligence tests (Bailey and Federman 1979;Beck et al. 1989;Gustafsson and Holmberg 1992).Alwin and Pacheco (2012) compared the difficulties of WORD-SUM items with those of the same items used in a 1941 Gallup survey.They found that words that were comparatively easy in 1941 became easier (as assessed in the 2004-2008 GSS samples) and that those that were more difficult in 1941 became more challenging in the more recent 2004-2008 GSS (p. 352-53) GSS (p. 352-53).Using this broader compass, Alwin and Pacheco (2012) conclude that the items measuring basic knowledge have become somewhat more difficult while those items measuring advanced knowledge have become somewhat easier.But because they did not directly assess intercohort factorial invariance, it is unclear whether the patterns they reported reflect genuine changes in the population.We replicated the Alwin and Pacheco (2012) analyses using a more powerful set of research strategies to compare word difficulty over time and across birth cohorts, providing a direct quantitative assessment of whether the GSS vocabulary test is becoming more difficult.In short, rather than assessing item difficulty using pass rates (proportion correct), we gauge item difficulty using IRT difficulty parameter estimates, as this approach yields a more statistically justified measure.For those unfamiliar with IRT models, we note that the difficulty parameters in IRT models estimate the point on the underlying continuum of word knowledge where half the population is expected to correctly answer the question.To save space, we will only briefly summarize results of formal tests of the measurement properties of the GSS vocabulary test, but we present a detailed discussion, including point estimates and other model results, in the online supplement.

Samples
Our analysis relies on 20 nationally representative, cross-sectional samples of U.S. adults, collected at roughly biannual intervals from 1974 to 2012 as part of the GSS (Smith et al. 2012).In each of the early GSS surveys, approximately 1,500 respondents were interviewed.For several years, the vocabulary test was only administered to two-thirds of the GSS sample.In more recent years, the GSS employed a larger overall sample size (∼3,000), so the number of respondents who completed the test since the 1990 has averaged roughly 1,500 per wave.
The inclusion of a vocabulary test in the GSS data has a number of advantages.Respondents in each sample completed an identical 10-item vocabulary test, allowing researchers to conduct more rigorous tests regarding cohort differences in verbal ability.Another strength of nationally representative data such as the GSS is that they contain a large amount of variation in the age, birth year, and educational attainment of research subjects, each of which is strongly associated with cognitive test scores.In keeping with the precedent set in previous research, we restrict the composition of the GSS sample to the native-born, English-speaking population (see Alwin 1991;Alwin andMcCammon 1999, 2001).The GSS sample is typically biased in the lower adult age ranges because the GSS does not interview eligible respondents on college campuses or military reservations (see Alwin 1991:628).For this reason, we restrict our sample to those aged 24 or older.We employ weights provided by the GSS study staff (variable name: WTSALL) that adjust for household size, oversampled subgroups, and experimental design effects of the GSS samples (see Appendix A 2932 of GSS documentation for details of sample and design weights).

Verbal Ability
Measures of vocabulary knowledge correlate highly with tests of general intelligenceusually in the range of 0.7 to 0.8 or higher (see Miner 1957:28-31)-and are considered good indicators of the verbal component of general intelligence tests (see also Wolfe 1980;Hauser and Huang 1997).The vocabulary test in the GSS data is one such test that has been widely used as a measure of verbal ability, with more than 100 published articles and chapters using these data in recent decades (Cor et al. 2012).WORDSUM is composed of 60 words or short phrases contained in ten questions consisting of the target word and five answer choices from which respondents must identify the correct synonym (one correct answer and four distractors). 2We identify the individual items from the GSS WORDSUM score using the mnemonics WORD A, WORD B, WORD C, et cetera, in keeping with the GSS study's desire not to publicize these words.Representative content from a related vocabulary test (Thorndike 1942) is given by Miner (1957:53).We recoded "Don't Know" responses to incorrect answers so that a 1 indicates a correct answer and 0 identifies answers that are not correct.A small amount of additional incomplete data was removed from the analysis using listwise deletion (0.004 percent, 104 cases) for a final sample of 22,684.
An important distinction between the GSS vocabulary test and the WAIS verbal ability test is that WAIS uses a "free recall paradigm," whereas the GSS test uses a "recognition paradigm" (Flynn 2012).The difference between the two approaches is that WAIS is designed to measure active vocabulary, which people typically acquire through conversation, while the GSS test measures passive vocabulary of the kind typically acquired by reading books.

Education Selectivity
We measure education selection using a cohort-level estimate of average years of educational attainment in our samples.Self-report education was used to generate estimates of mean cohort education levels (see Table 1 and Figure 2 for quantitative and qualitative presentations of the measure, respectively).This measure reflects cohort differences in the selectivity of the education system over historical time.Net sociological science | www.sociologicalscience.com of this measure, the person-level years of completed schooling captures respondents' within-cohort relative ranking in the education distribution.

Measures of Word Exposure
The data source from which we developed a cohort-specific measure of word exposure was the Google Books corpus (Michel et al. 2010).The now-public data, most commonly known through the Google Books Ngram Viewer (https://books.google.com/ngrams),currently contains roughly 500,000,000,000 English words dating back to the 1500s.In total, each of the more than 15,000,000 scanned and digitized books represents approximately 12 percent of books published between 1500 and 2009 (Michel et al. 2010). 3Validity tests of these data demonstrate that compared to American English-language subtitle lexicon data, the Google Books corpus have similar predictive power in explaining word recall (Brysbaert et al. 2011).Because of post-2000 changes in the composition and method of datifying books in the corpus 4 , researchers should use caution in interpreting changes in word frequencies during this period, as they may reflect compositional changes in the database rather than functional changes in the word usage.
A case-insensitive data query was used to extract raw frequencies, counts of each word comprising the WORDSUM test, year in which the word appeared in a published book, the number of times it was used, and the total number of words in the corpus for every year from 1885 to 2003.In keeping with precedent, we converted the annual frequencies into percentages to account for the substantial increase in the overall size of the corpus from one year to the next. 5 Cohort theories posit that socialization occurring during a person's formative years, typically understood to span the period from birth through adolescence, tend to leave an indelible mark that persists over the life course (Mannheim 1952;Ryder 1965).Likewise, crystallized intelligence, including verbal ability, undergoes substantial development during a person's early years, due in part to the learning environments of the home, community, and school (Cattell 1963(Cattell , 1971a(Cattell , 1971b;;Horn 1982aHorn , 1982b;;Horn and Cattell 1967;Horn andDonaldson 1976, 1980).For the purposes of this research, we define "early life" as the first 15 years of life.Using this definition, we averaged the word exposure percentages from year of birth to year in which respondent reached age 15 using a 15-year, forward-moving average.We then standardized and averaged these 60 variables into three summary measures of cohort word exposure, including a 46-item measure of basic vocabulary knowledge (6 questions * 6 words), a 24-item measure of advanced vocabulary knowledge (4 questions * 6 words), and a 60-item measure of general vocabulary knowledge (all 10 questions * 6 words). 6The summary measures of cohort word exposure contain 104 unique values-one for each birth cohort in the study-with ranges and means approximating a z-score distribution.Descriptive information on education selection, word exposure, WORDSUM, and other key variables are reported in Table 1.

Factorial Structure and Item Functioning
We begin by briefly summarizing our analysis of the measurement structure of WORDSUM, including exploratory factor analysis, IRT analysis, and MGCFA analysis.Full results, including point estimates and visual analysis, can be found in the online supplement.These analyses allowed us to assess the viability of making intercohort comparisons in mean scores on the GSS vocabulary test.
Model results (see online supplement Table A2) provided unequivocal evidence that WORDSUM is composed of two factors, one based on easy words and one based on advanced words (Bowles et al. 2008).A two-factor solution vastly improved the fit of the model to the data compared to a single factor solution.Variation in item difficulties-an individual ability level ranging from -3 to +3 that identifies the point at which the respondent has a 50 percent chance of answering the test question correctly 7 -provide further evidence of two factors in this measure of verbal ability.With item difficulties below -0.5, words A, B, D, E, F, and I measure basic verbal ability.Words C, G, H, and J, with item difficulties above 0.5, measure advanced verbal ability.
Period and cohort trend analysis of item difficulty parameters identified a slight decrease in item difficulty on the basic words among pre-1945 cohorts and a slight increase in item difficulty on the advanced words beginning with the post-1945 cohorts, but these trends are not linear.After 1945, IRT difficulty trends on the basic words was relatively flat, though several items did increase slightly in difficulty among the most recent birth cohorts.
MGCFA model results, which were estimated in Mplus (Muthén andMuthén 1998-2011;Muthén 2002), indicate that the formal measurement properties of the two-factor measure of vocabulary knowledge do not appreciably vary by birth cohort.Applying equality constraints to the factor loadings and item thresholds (scalar equivalence) resulted in a 0.009 (0.986-0.977) change in the Conditional Fit sociological science | www.sociologicalscience.com Index (CFI) from the unconstrained model (configural equivalence).Other model fit statistics, including the Root Mean Squared Error of Approximation (RMSEA) and Tucker-Lewis index (TLI), show very modest differences between configural and scalar invariant models (0.005, and 0.003, respectively).The period trend in the difficulty parameters is consistent with prior IRT analyses of the GSS verbal ability scores (Beaujean and Sheng 2010;Alwin and Pacheco 2012).These results indicate that changes in the formal measurement properties of the GSS measure of verbal ability are not the cause of the intercohort decline in verbal ability.
By documenting the measurement invariance of the two-factor model for vocabulary measures, the results reported here provide the first evidence for intercohort comparability of WORDSUM.This is important because previous research has implicitly assumed factorial invariance without formally testing this assumption.These results also confirm the existence of two factors, basic and advanced vocabulary knowledge, in WORDSUM and establish the intercohort comparability of the GSS measure that is assumed in the regression analyses presented below (Horn, McArdle, and Mason 1983).
We next report results from trend analysis of word exposure and regression analysis of the effect of education selection and word exposure on cohort differences in verbal ability.

Word Exposure
Figure 3 reports exposure to the words comprising the GSS vocabulary test separately for each item and cumulatively in the form of two indexes: one each for the basic and advanced vocabulary factors identified in previous research (Malhotra, Krosnick, and Haertel 2007).As can be seen from the word usage trends, some items have become more common while others have become less common.For example, the 12 words and phrases comprising items A and B have increased in frequency over the twentieth century, while the words comprising items E and I have decreased in frequency.Others, such as Word C and Word F, exhibit nonlinear patterns of exposure.
Exposure to the words comprising the basic vocabulary test items on WORD-SUM has become less common over the course of the twentieth century.Exposure to advanced words, on the other hand, declined only slightly prior to 1950, after which these words also became less common (bottom right two panels of Figure 3).

Multivariate Models
A number of modelling strategies have been utilized in prior research studying age, period, and cohort effects (APC) on WORDSUM scores in the GSS data, including linear, hierarchical, and ridge regression models (Yang andLand 2006, 2008;Yang et al. 2008).This body of work failed to detect period influences in the GSS data, instead finding that cohort effects in the GSS data are robust to the presence of measures of age and period factors.The growing consensus from this research is that changes in WORDSUM test scores are not related to factors associated with historical time.These findings, coupled with the absence of a clear theoretical justification for assuming the presence of period influences that affect the general cognitive sociological science | www.sociologicalscience.com Note: Word exposure measured using the standardized share of WORDSUM test words relative to all words in the Google corpus.We use a 15-year, forward-moving average to smooth the exposure data.See methods section for detailed explanation.
abilities of the entire test-taking population from one occasion of measurement to the next, suggest that linear regression models are appropriate for testing our hypotheses.In the analysis that follows, we deploy a mechanisms-based approach (Winship and Harding 2008) to measure the effect of word exposure and education selection on intercohort differences in verbal ability.Rather than use birth year as a proxy for cohort effects, we deploy measures of the mechanisms hypothesized to differentially transmit verbal ability across birth cohorts, including exposure to formal education and early-life exposure to words on the GSS vocabulary test.
Table 2 contains regression coefficients and model statistics in which we measured the effect of word exposure and education selection on intercohort differences in verbal ability.The relevant comparisons in Table 2 are changes in the magnitude, direction, and statistical significance of the cohort coefficients from the baseline models that control only for respondent schooling and age (models 1, 4, and 7) to the models that include additional controls for education selection (models 2, 5, and 8) and word exposure (models 3, 6, and 9).The key finding, which can be seen by comparing point estimates in models 1 and 2 and models 5 and 6, is that cohort differences in educational attainment accounted for virtually all of the prewar cohort decline in basic and advanced vocabulary knowledge.The inclusion of a measure of education selection effectively eliminated intercohort test score differences, as measured by cohort specific slopes.In line with our assertion, education selection had a negligible effect on postwar cohort mean test scores.The twin effects of respondent education and cohort education levels on the intercohort trend in verbal ability offers support for arguments regarding the influence of mass education on population trends on standardized cognitive tests (Baker et al. 2015).
We also find that controlling for word exposure substantially influenced the adjusted cohort test score trends in these data compared to the baseline cohort test scores.Exposure to advanced vocabulary words effectively accounted for all of the postwar cohort test score deficit (compare models 4 and 5 to 6).Controlling for exposure to advanced vocabulary words reduced the number of statistically significant cohort coefficients from nine to one (compare models 5 and 6).In nearly all instances, changes in the statistical significance of the cohort coefficients was due to substantial changes in the estimated slopes in the predicted direction, attended by relatively smaller increases in the standard errors. 8 Exposure to the WORDSUM test content found in English-language printed books exerted substantively meaningful effects on cohort test scores, net of respondent age, education, and average cohort schooling.Controlling for word usage in books contributed to a 50 percent or greater decline in the size of postwar cohort slopes on the full-scale measure of verbal ability (compare models 8 and 9), net of the effects of other variables in the model.
Thus, the substantial cohort increase in average completed schooling during the first half of the twentieth century accounts for cohort differences in advanced vocabulary knowledge prior to 1950, and declining exposure to advanced words during the second half of the century effectively accounts for cohort variation in advanced vocabulary knowledge after 1950.These findings agree with the Hauser and Huang (1997) assertion for time-varying effects of word exposure on vocabulary knowledge, whereby education selection operates among earlier-born cohorts and word exposure operates among both earlier-and later-born cohorts, though with greater effect on recent birth cohorts.These results provide direct empirical support for both the exposure and selection hypotheses and offer additional evidence in support of environmental accounts of intercohort test score patterns in the GSS data (Flynn 2012).
These results also demonstrate that virtually all of the explanatory power for the cohort slopes on the general vocabulary factor resides in exposure to advanced vocabulary words, leading us to conclude that declining exposure to advanced vocabulary words is an important determinant of macrosocial test score trends in the GSS data.One reason that prewar cohorts perform better on the GSS vocabulary test than postwar birth cohorts is that many of the difficult words on standardized tests such as WORDSUM are, by definition, "old" words (Roivainen 2013).As words age and fall out of fashion, as a great many words invariably do, they become more "difficult" in an empirical sense by virtue of their decreasing usage in various forms of communication. 9The declining prevalence of words gives earlier-born cohorts an advantage on vocabulary tests containing words that have fallen out of favor, as is the case in the GSS test (see Figure 3).Our models show that once we account for the prewar advantage in exposure to advanced words, differences in age, education, and cohort schooling, recent cohorts perform better on the test than earlier-born cohorts.
We performed various robustness checks by including additional control variablessuch as respondent sex, race, number of siblings, and occupational prestige-and controls for rural upbringing, early life spent in the South, a measure of family intactness, and the educational attainment of the respondent's father and mother.The inclusion of these control variables resulted in a modest improvement in model fit, and all effects were statistically significant, with the exception of family intactness.The presence of additional covariates did not, however, have a meaningful effect on our inferences.If anything, their inclusion in the models increased the effect of exposure to advanced words on both general and advanced vocabulary knowledge, providing further support for the conclusions we draw from these analyses.
To summarize the key findings from Table 2 and further establish the main findings of this article, we report the cohort coefficients visually in Figure 4.The substantial over-time decrease in exposure to advanced words on the WORDSUM test accounts for the intercohort decline in verbal ability among post-1950 cohorts.These conclusions are most clearly seen in panels B and C of Figure 4, where we graph cohort test score trends for advanced and general vocabulary.Once we accounted for the substantial increase in schooling during the first half of the century, the cohort trend in advanced verbal ability was effectively flat for the entire period from 1900 to 1988.The effect of differential exposure to vocabulary words was even more pronounced on general verbal ability, which we report in panel C to facilitate comparisons with prior research based on a single factor.In contradiction to prior research (see Figure 1), a model that accounts for education selection and cohort differences in exposure to words provide support for the progress mode (Flynn Effect).
By way of illustration, consider the cohort-specific test scores reported in Figure 4, panel C.The pre-1900 cohorts scored an average of 71 percent on the WORD-SUM test (µ = 7.1), compared to an average score of 58 percent for the 1976-1980 cohorts (µ = 5.8) in the baseline models that control for age and time in school.After adjusting for education selection and word exposure, the test score gap between the earliest and latest born cohorts not only decreased by 13 percentage points (7.1-5.8)but reversed direction to one favoring the later-born cohorts by three percentage points (5.9-6.2) for a gross change of 16 percentage points.These data suggest that if recently-born cohorts had been exposed to the test content at equivalent levels as were prewar cohorts, their test scores would be higher than early-born cohorts.In other words, the counterfactual results we present here suggest that the generational rise in crystallized intelligence has not yet abated in America.When properly adjusted for cohort differences in word exposure, the GSS data indicate that the verbal ability of today's young people is not in decline.Rather, they have simply had less exposure to "old" words.Flynn (2012) suggested that one possibility for generational differences on verbal ability tests is a growing gap in the active  vocabulary of old and young people.Our results provide tacit empirical support for these claims.

Discussion and Conclusion
We began this article by drawing attention to a longstanding puzzle in the literature regarding national trends in verbal ability, which indicated a counterintuitive pattern of declining scores on the GSS vocabulary test beginning with cohorts born after WWII.The trends in the GSS vocabulary test score data are not unique, in that they parallel test score declines observed on other standardized tests during the 1960s and 1970s.However, they do not follow the pattern documented in other tests of cognitive ability (Flynn 2007(Flynn , 2012)), namely monotonically rising test scores for each new birth cohort.The Flynn Effect is observed in the raw WORDSUM scores for cohorts born between 1885 and 1950 but not for postwar birth cohorts.
The GSS data provide evidence that is completely reversed from the expectations based on a general consensus in the literature that not only is the Flynn Effect real but that it can be interpreted as reflecting an increasing aptitude for cognitive skills being taught in schools and utilized in a growing number of cognitively demanding occupations (Flynn 2007(Flynn , 2012)).
A thorough review of all of the interpretations that have been advanced for these cohort differences was beyond the scope of this article.Instead, we have focused on the possibility that the intercohort WORDSUM trends derive from changing exposure to words and to cohort-specific education selection.Our analysis shows that one of the major mechanisms for the transmission of knowledge is formal schooling, and there is a strong relationship between time spent in school and vocabulary knowledge (Baker et al. 2015).However, the education payoff for vocabulary knowledge in later-born cohorts is not reflected in the words included in common tests of verbal ability, due in part to their declining usage in contemporary books.
Our analysis considered the role of education selection.That education experiences are so intimately tied to what we mean by intellectual ability-certainly in the sense of crystallized abilities-makes it virtually impossible to talk about cognitive abilities without taking education into account.Observed test score differences among age groups must therefore also account for age differences in education.Decades ago, Lorge (1956:133) suggested that because "schooling makes for a difference in intelligence test scores...it must follow that the older [earlier born] members of the population are at some disadvantage not only because of their remoteness from formal schooling but also because they had less of it."More recently, Bandura (1989:734) argued that a major part of age (cohort) differences in test scores "seems to be due to differences in educational experiences across generations rather than to biological aging.It is not so much that the old have declined in intelligence but that the young have enjoyed the benefit of richer intellectual experiences enabling them to function at a higher level." Our examination of the selection phenomenon argued that the schooling experiences of members of different cohorts measured in the GSS samples do not reflect intertemporal homogeneity.What we found was somewhat counterintuitive because cohorts experiencing the highest levels of schooling had the lowest vocabulary test scores (i.e., mean cohort education has a negative effect in our models).So, while the meaning of "years of schooling" may have changed over the twentieth century, the evidence from the GSS data cited earlier suggests education has diminishing returns among more recent cohorts, at least with respect to vocabulary knowledge.The education selection process can be used to account for intercohort differences among prewar birth cohorts but not those cohorts born later.
The research presented here points to a cohort advantage on tests using words that were once relatively common but then fell out of fashion (Alwin 1991).We found strong support for the exposure hypothesis, whereby declining frequency of WORDSUM items in published texts accounts for virtually all of the postwar cohort decline in vocabulary ability.The measure of word exposure introduced in this research is needed to account for the declines in GSS test scores among those post-WWII birth cohorts.For this reason, we recommend that researchers using WORDSUM adjust for differential cohort exposure to the test items.
Our model results provide support for environmental causes of the intercohort differences in verbal ability, including the word exposure hypothesis originally formulated by Alwin (1991) and the education selection argument made by Huang (2003).These results provide a more compelling and theoretically grounded explanation for intercohort variation in vocabulary knowledge than previous explanations that have looked to age and period factors (e.g., Wilson and Gove 1999;Yang andLand 2006, 2008).
Birth cohorts of the twentieth century differ from one another in several key ways, but just two factors, exposure to words and exposure to formal schooling, account for virtually all of the observed between-cohort differences in verbal ability on the GSS test.Our results also suggest that there may be different causal processes at different periods in the macrosocial trend (i.e., different explanations at different historical periods).A promising line of future research would be to replicate our modelling strategy among the populations of advanced economies to see if recent declines in test scores are due to changes in exposure and selection that we observed in American samples (for a summary of recent findings, see Dutton and Lynn 2013).

Notes
1 We remind reviewers that the relationship between vocabulary test scores and intelligence only holds within-cohorts.
2 The GSS protocol for administering the vocabulary test is as follows: we would like to know something about how people go about guessing words they do not know.On this card are listed some words-you may know some of them, and you may not know quite a few of them.The card handed to the respondent included instructions as follows: Please look first at the word in capital letters on each line.Then look at the other words in smaller type on the same line and tell me which one of these words comes closest in meaning to the one in capital letters.EXAMPLE BEAST 1. Afraid 2. Words 3. Large 4. Animal 5. Bird The correct answer in this example is No. 4 because the word "animal" comes closer to "beast" than any of the other words.On each line, the first word is in capital letters-like BEAST.Then, there are five other words.3 The Ngram database contains all words appearing at least 40 times in the corpus.To date, Google has supplied six searchable American English databases (one-gram, two-gram, three-gram, four-gram, and five-gram databases).The one-gram database contains single words (e.g., fish), the two-gram database contains combinations of up to two words that were adjacent to each other in the scanned books (e.g., catching fish), the three-gram database contains combinations of three adjacent words (likes catching fish), and so on.
4 Many books were provided to the Google Books project directly in digital format, thus eliminating the scanning and optical character recognition stage of data collection.
Post-2000 books skew heavily toward fictional texts, which have substantially different readership among the general public (Michel et al. 2011).
5 An appropriate test of the word exposure hypothesis calls for a measure of exposure that accounts for change in WORD usage relative to all other words in the database rather than a simple change in the absolute occurrence of each WORD.Since WORDSUM weights each word equally, our summary measure of exposure needed to do the same.
We accomplished this through a standardization of each of the 60 WORD scores.This is important because overall usage of each WORD varied considerably, with some words being very common and others obscure (in some cases by an order of magnitude of two or more).In a composite scale, using unstandardized frequencies would allow a small number of high frequency words to overwhelm the effects of less frequently used words.
6 Multiple choice tests such as the one employed in the GSS contain an inherent dependency among the target word and answer choices.That is, a respondent that knows the meaning of the incorrect choices might ascertain the correct meaning of the target word by process of elimination, even if they have not been exposed to the target word itself.
sociological science | www.sociologicalscience.comTable 2:Regression coefficients expressing the effect of word exposure on basic, advanced, and general vocabulary sociological science | www.sociologicalscience.com 400 June 2016 | Volume 3

Figure 4 :
Figure 4: Adjusted cohort test scores in basic, advanced, and general vocabulary knowledge.Note: Baseline model includes controls for respondent age and education.Selection includes an additional control for mean cohort education levels.Exposure also controls for cohort-specific exposure to test items.Trend line point estimates are derived from Table 2.
might explain the intercohort decline in the GSS data as well as more general test score results.
Because much of what we know about over-time, aggregate changes in performance on cognitive tests is based on studies of student populations, measurements of cognitive abilities in nationally representative samples offer researchers unique opportunities to test social and demographic propositions regarding human abilities.

Table 1 :
Descriptive Statistics(N = 22,684) Basic vocabulary knowledge measured using six WORDSUM items.Advanced vocabulary knowledge measured using four WORDSUM items.General vocabulary knowledge measured using all 10 WORDSUM items.Results are sample weighted.Models estimated in Stata 13.

Table 2 .
sociological science | www.sociologicalscience.com 402 June 2016 | Volume 3 Tell me the number of the word that comes closest to the meaning of the word in capital letters.For example, if sociological science | www.sociologicalscience.com 404 June 2016 | Volume 3 the word in capital letters is BEAST, you would say "4" since "animal" comes closer to BEAST than any of the other words.If you wish, I will read the words to you.These words are difficult for almost everyone-just give me your best guess if you are not sure of the answer.