Structure of online dating markets in US cities

We study the structure of heterosexual dating markets in the United States through an analysis of the interactions of several million users of a large online dating web site, applying recently developed network analysis methods to the pattern of messages exchanged among users. Our analysis shows that the strongest driver of romantic interaction at the national level is simple geographic proximity, but at the local level other demographic factors come into play. We find that dating markets in each city are partitioned into submarkets along lines of age and ethnicity. Sex ratio varies widely between submarkets, with younger submarkets having more men and fewer women than older ones. There is also a noticeable tendency for minorities, especially women, to be younger than the average in older submarkets, and our analysis reveals how this kind of racial stratification arises through the messaging decisions of both men and women. Our study illustrates how network techniques applied to online interactions can reveal the aggregate effects of individual behavior on social structure.


I. INTRODUCTION
Patterns of romantic and sexual partnerships-who pairs with whom-have broad implications for health and society. For example, the level of assortative mating (the extent to which like pairs with like) has long been considered an indicator of societal openness [1,2]. Mating patterns also determine how wealth and resources are passed from one generation to another, and hence persistence or change in inequality over time [3,4], have implications for mental and physical health [5,6], and shape the sexual networks that drive the spread of sexually transmitted infections [7,8].
There exists an extensive empirical and theoretical literature exploring the mechanisms behind patterns of romantic pairing [9,10]. In societies where people choose their own mates, it is widely accepted that romantic pairing is driven by the interplay between individuals' preferences for partners and the composition of the pool of potential mates [11][12][13]. The process can be modeled game theoretically as a market in which individuals aim to find the best match they can subject to the preferences of others [14,15]. There is also a large body of empirical work that documents the relationship between observed partnering patterns and the supply of partners, as reflected in the population composition of cities, regions, or countries [16][17][18][19][20][21][22][23][24][25].
These studies are limited, however, in what they can reveal about the structure of dating or marriage markets. One issue is that we typically do not have access to the actual population of available dating partners and must instead make do with proxies such as census data, obliging us to treat entire towns or cities as a single undifferentiated market. A more fundamental problem is that previous studies have only looked at extant partnerships, and not the larger set of all courtship interactions among mate-seeking individuals. In order to properly study dating markets, one needs data on all courtship overtures that occur within the focal population, not only those that are successful and result in a partnership but also those that are rejected. As we show in this paper, the complete set of such overtures forms a connected network whose structure can be analyzed to reveal key features of romantic markets.
Unfortunately, complete data on courtship interactions have been historically hard to come by because unrequited overtures are rarely documented. The few empirical studies that have directly observed courtship patterns have tended to focus narrowly on specific institutions, subpopulations, or geographic locations [26,27], and relatively little is known about the empirical structure of romantic and sexual markets across the general population or how this structure varies from one locale to another.
The advent of online dating and its spectacular rise in popularity over the last two decades has, however, created a new opportunity to study courtship behaviors in unprecedented detail [28]. Here we report on a quantitative study of the structure of adult romantic relationship markets in the United States using nationwide data on online dating users and their behaviors. We combine activity data for millions of participants with recently developed network analysis methods to shed light on the features of relationship markets at the largest scales. There have been recent studies using early-stage patterns of online mate choice-who browses, contacts, or responds to whom-to shed light on individuals' preferences for mates [29][30][31][32], but the work presented here goes beyond these studies to examine how individuals' choices aggregate collectively to create structured relationship markets that strongly influence individuals' dating experiences.

II. RESULTS
The data we analyze come from a popular US online dating web site with over 4 million active users at the arXiv:1904.01050v2 [cs.SI] 3 Apr 2019 time of our study. The data are described in detail in Section IV A and Appendix A. Our analysis focuses on all (self-identified) heterosexual, single men and women who sent or received at least one message on the site during the period of observation, January 1 to 31, 2014, and who indicated that they were pursuing some form of romantic relationship (long-term dating, short-term dating, and/or sex). For each user we have a range of self-reported personal characteristics along with timestamped records of all messages exchanged on the site. It is the latter that are the primary focus of our analysis, since it is the messaging patterns that reveal the aggregate demand for individuals within the market.
We quantify messaging patterns using methods of network analysis [33]. We examine the set of all reciprocal interactions between opposite-sex users, meaning pairs of individuals such that at least one message was sent in each direction between the pair. Reciprocal interactions we take to be a signal of a baseline level of mutual interest between potential dating partners. Our primary focus is on understanding the division of the online dating population into distinct submarkets: how does the market divide into subpopulations of daters and how are those subpopulations characterized? We define submarkets as roughly self-contained groups of individuals within the network such that most reciprocal exchange of messages occurs within groups. This corresponds closely to the established concept of "community structure" in network theory, a community in this context being a tightly knit subgroup of individuals within a larger network. A number of sensitive techniques for the detection of network communities have been developed in recent years [34], and we employ a selection of those techniques here. Technical details of the algorithmic methods used in our calculations are given in Section IV B and Appendix B.
A. Dating markets are divided into distinct geographic regions For our first analysis, we examine community structure within the entire data set of all users of the web site during the month of observation. A total of 15 302 512 reciprocal interactions took place during this period. We aggregate these interactions at the level of 3-digit ZIP codes-geographic regions used by the US Post Officeand count the number of interactions that take place between every pair of 3-digit ZIPs. For instance, there were 75 686 reciprocal interactions between individuals in Manhattan and individuals in neighboring Brooklyn, but only 2170 interactions between individuals in Manhattan and individuals in far-away San Francisco.
The result of this aggregation is a weighted network in which the nodes represent 3-digit ZIP code regions and the weighted edges represent the number of interactions. We take this network and perform a standard community detection analysis on it using the modularity maximization method (see Section IV B and Appendix B for details). The results for the lower 48 states are shown in map form in Fig. 1.
As the figure shows, the communities found in this nationwide network of messaging are tightly geographically circumscribed. Many of them appear to correspond to commonly accepted geographic divisions of the country: New England, the East Coast, the South, Texas, the Mountain West, North and South California, and so forth. In essence the analysis says that most people are interested in others who are in their own region of the country, which is reasonable. Few people living in New York will exchange messages with people across the country in California if the primary goal is to arrange a faceto-face meeting with a possible romantic partner [35]. This finding is consistent with recent work looking at friendship communities using Facebook data, which finds that incidence of friendship is strongly decreasing with geographic distance [36,37].
Community structure in the broad, nationwide network of messaging thus appears to be dominated by geographic effects. Since our primary goal here is to observe and analyze more subtle demographic effects within dating markets, we need to factor out the gross influence of geography. Our approach for doing this is a simple one: we focus on subnetworks within individual cities. We choose cities as our basic unit of analysis because they are large enough to provide a population of significant size, yet small enough that travel distance between individuals will not be a deterrent to interaction. In the remainder of this paper we perform a series of analyses on subsets of the data corresponding to four large cities: New York, Boston, Chicago, and Seattle. We define cities using the standard Core-based Statistical Areas (CBSAs) for the corresponding metropolitan regions, except for New York, where the CBSA is large enough that there are clearly separate dating markets within it. For New York we therefore define our area of study more nar-rowly to be the five boroughs of Manhattan, the Bronx, Queens, Brooklyn, and Staten Island.
B. Dating markets are demographically stratified within cities Community structure at the city level is more complex than the simple geographic effects we saw in Fig. 1. Specifically, it displays a mix of so-called assortative and disassortative mixing [38]. For the heterosexual dating communities studied here it is disassortative by gender, meaning most messages are between individuals of opposite sex, but assortative by various other characteristics, as we will see. It is the latter behavior on which we primarily focus, but our community detection calculations need to be sensitive to both in order to fully reveal the structure of the market. Here we make use of a powerful and flexible community detection method based on maximum-likelihood techniques, the expectation-maximization (EM) algorithm, and belief propagation [39,40], which can sensitively and rapidly detect complex forms of structure in large networks. For details of the method see Appendix B.
Focusing again on networks of two-way message exchanges, we present in the following analyses the results of community divisions of each city network into four separate communities or submarkets (or eight if you count men and women separately). We find that about 75% of all reciprocal interactions in our four cities are between individuals within the same submarket, indicating that the communities align well with the conventional definition: tightly-knit groups with most interaction going on within groups. The choice to divide into four submarkets is to some extent arbitrary. We have repeated the analysis for other numbers of submarkets and find essentially similar patterns to those reported here-see Appendix B. The choice of four submarkets offers a good compromise between resolution of finer details and adequate statistical power within submarkets. Figure 2 shows a variety of demographic features of the submarkets in the four cities. The most obvious defining feature of the submarkets is the age of their members, shown in Fig. 2A. The youngest submarket, numbered 1 in each city, corresponds primarily to individuals in their lower 20s, while submarkets 2 to 4 correspond respectively to upper 20s, 30s, and 40s and above. This pattern is consistent, with only minor variation, across the four cities. As the figure shows, there is a small but systematic difference in age between men and women across all submarkets: in every case the men are older than the women, with a median age difference of 1 year and 7 months.
However, submarkets are not characterized by age alone. As Fig. 2B shows, they also differ in male-tofemale ratio, and here we see another consistent pattern: the younger submarkets tend to be male-heavy but the mix becomes progressively more female-heavy in the older submarkets. There are a number of factors that may drive this pattern. Women's first marriages are at a younger age on average than men's [41, 42], which takes more women than men out of younger dating markets. Furthermore, since partnering of younger women with older men is more common than the reverse [29,43], some older men may seek out younger partners, swelling the ranks of men in the younger submarkets. Conversely, some younger women may leave the youngest submarkets in search of older partners, depleting the supply of women. (This would also help explain the higher average age of men in each submarket.) The same behaviors also reduce the number of men in the older submarkets and increase the number of women. Depending on the overall population balance of the city, the end result can be a severe distortion of the sex ratio at the oldest or youngest ages. The youngest submarkets in Chicago and Seattle, for example, have almost two men for every woman.
A further facet of the submarket structure, one that affects predominantly women, comes to light when we look at the balance of ethnicities. Figure 2C shows the mean age of minority women in each submarket broken down by ethnicity and measured, in this case, relative to the mean age of white women in the same submarket. The plot demonstrates a systematic tendency for minority women to be younger than their white counterparts within the same submarket. The effect is small in the younger submarkets but becomes more pronounced in the older ones. This is partly due to the fact that there are fewer black women than white women among the oldest users of the site (see Appendix C, Fig. 7), but these compositional effects are not large enough to account for the pronounced age difference seen in Fig. 2C. Studies of mate preferences of online daters have shown that black women are on average viewed by heterosexual men as less desirable partners than nonblack women [29,30,44,45], and the behavior seen in Fig. 2C may reflect the aggregate outcome of such preferences at the submarket level. In Chicago's oldest submarket, for instance, black women are more than eight years younger on average than white women, suggesting that men in that submarket are exchanging messages with black women who are substantially younger than the white women they exchange messages with [46].

C. Dating markets reflect the aggregated choices of individuals
Next we examine how the choices of men and women about whom to message differ across submarkets, and by gender. Since men send more than 80% of first messages on the site, we focus on men's first messages and women's replies. Figure 3 shows the difference between the age of men and the women they message, by submarket and race, in Chicago and New York, in the form of "heat maps." (Similar results for Boston and Seattle are shown in Appendix C, Fig. 8.) The rows labeled "1st messages" show age difference in first messages, and the rows labeled "Replies" show the age difference in replies, with brighter colors corresponding to larger age differences. We see that in both Chicago and New York the age differences between men and the women they message are approximately two to three times larger in the oldest submarket than in the youngest. This is consistent with previous work showing that men's preferences for partners become more pronounced as they age [47]. Figure 3 also sheds light on the behavioral mechanisms driving the racial stratification patterns we observed in Fig. 2C. The top two rows of the figure for Chicago reveal that white men in older submarkets pursue minority women who are on average two or more years younger than the white women they message. This is especially pronounced in submarket 4, where the average age gap between white men and the minority women they write to is around five to six years, compared to two years for white women. However, minority women tend not to reciprocate overtures from older white men, which is why the age gap in replies among minority and white women is not as pronounced. The one exception is for black women in Chicago: the average age gap in messages between these women and the white men they respond to is around 5.8 years. Thus it is both how men pick the women they message and also how women reply that drives the racial stratification we saw in Fig. 2C.
In New York the messaging patterns look somewhat different from Chicago because New York men, despite being of similar age to their Chicago counterparts, pursue younger women on average. Black men in the oldest New York submarket write to women who are on average 4.5 years younger than they are, while for white men the corresponding figure is 6.2 years. And while older white men in New York message younger black and Asian women than white women, the differences are slight: women of all races in New York's submarket 4 are being pursued at younger ages, so the racial difference is more attenuated. In other words, it's not that black women in New York's oldest submarket receive messages from younger men than black women in Chicago's oldest submarket (i.e., men closer to their own age), but that white women in New York's oldest submarket receive messages from older men than white women in Chicago's oldest submarket. Overall, we see that men and women's choices about who to message and respond to shape submarket structure differently in the two cities.
Additional features of interest in the submarket struc-

FIG. 3:
Mean difference in years between the age of men of varying races in Chicago and New York (vertical axis) and the women they message, by race of women and by submarket (horizontal axis). Race is coded as A = Asian, B = black, H = Hispanic, and W = white. The first two rows show the average age difference for, respectively, all initial messages sent by men in Chicago and those that received a reply. The bottom two rows show the same patterns for New York. In both cities the age gap between men and their potential mates increases (lighter colors) as we move from younger to older submarkets. In addition, we see that black and white men in the oldest New York submarket pursue younger women, on average, than black and white men in the oldest Chicago submarket. However, unlike in Chicago, only Asian women are pursued by older black men in New York at substantially younger ages than their non-Asian counterparts. White men in the oldest submarket pursue both Asian and black women at younger ages, compared to Hispanic and white women.
ture are revealed by an examination of messaging patterns within and between submarkets. For this analysis we focus on initial contacts between individuals and on whether those contacts receive a reply. Across all submarkets and cities, we find that 57% of first contacts are between users in the same submarket. The remaining 43% are between users in different submarkets and the pattern of within-and between-group messages, depicted in Fig. 4, shows a number of interesting regularities. The first and third rows of the figure show data for initial contacts made by men and women respectively. The bright squares down the diagonal of each matrix represent the large fraction of within-group contacts. The darker squares off the diagonal show that users are sending a modest number of messages to the submarkets immediately older and younger than their own, but very few messages to submarkets two or more steps away. One deviation from this pattern is visible in the messages sent by men in submarket 3 (the 30-somethings). Across all four of our cities, this group is the only one whose members send a majority of their messages to women in differ- ent submarkets from their own, the largest number going to women in the next youngest submarket, submarket 2 (mid-to-late 20s). The second and fourth rows of Fig. 4 give the fraction of first messages that receive a reply-establishing a possible reciprocal interest between the individuals in question. Women's replies to messages sent by men (second row) occur at a substantially lower rate than men's replies to women (fourth row), which is likely a volume effect: since women receive four times as many first messages as men, they can afford to be more selective in their replies. Again, across all cities and among both men and women, reply rates are highest within submarkets. Women receive replies more often when initiating contact with men in older submarkets compared to younger ones (which is consistent with prior studies), although there are some exceptions. Notice for instance that in all cities women in the oldest submarket (submarket 4) aresurprisingly-more likely to receive a reply from men in the youngest submarket (submarket 1) than in the second youngest (submarket 2).

III. DISCUSSION
The experience of mate selection is frequently described, both in popular discourse and in the scientific literature, in the language of markets: an individual's goal is to secure the best possible mate for themselves in the face of competition from others. However, we know little about the structure of these romantic markets in part for lack of appropriately detailed data. The advent and vigorous growth of the online dating industry in the last two decades provides a new source of data about courtship interactions on an unprecedented scale.
In this study, we have provided a first look at how network analysis techniques can reveal the structure of US dating markets as evidenced by interactions on a popular online dating website. Across the US as a whole, we find that geography is the defining feature of national dating markets. Within cities, submarkets are defined by age as well as other demographic factors-most notably, race. We find that submarket structure is shaped by both first messaging patterns and replies. Three-quarters of all reciprocated messages fall within submarkets and only a quarter between individuals in different submarkets. A larger fraction, about 43%, of all first messages are between different submarkets, which indicates that people do attempt to contact partners outside of their submarkets, but that those attempts are often unsuccessful. Overall, our results reveal the aggregate implications of individuals' mate choices, and suggest that metropolitan areas are best characterized as a collection of geographically integrated but demographically distinct submarkets.
More generally, our study illustrates how state-of-theart network science techniques can be applied to rich data from online interactions or administrative records to reveal subtle features of social structure. In recent years the growing availability of search data from online sources has led to interest in how individuals' choices reveal submarkets in other social domains [48, 49]. As we have shown in the dating context, market outcomes reflect the choices made by actors on both sides (e.g., men and women in heterosexual dating markets, workers and firms in job markets). Our approach could straightforwardly be extended to look at structural features of housing or job markets, and we view this as a fruitful direction for future work.

A. Data
The data used as the starting point for our study come from one of the largest free dating sites in the United States and were collected in July 2014. The site does not market itself to any particular demographic group and attracts a diverse population of users whose makeup, in most locales, corresponds loosely to that of the general population. The site is known for its user-driven matching algorithm, which reduces the effect of site interference on users' mate choice behavior. The population of users is concentrated in coastal areas, although there are significant numbers of users in major Midwestern cities such as Chicago. We restrict our analysis to active users, which we define to mean that they sent or received at least one message on the site during the observation period, which was January 1 to 31, 2014. This eliminates a significant number of users who sign up and use the site but then become inactive, or who sign up and never use it. We also remove from the data all users who identify as gay or bisexual (about 14% of the overall user base of the site) and those who indicate that they are not looking for romantic relationships. (People can indicate, for example, that they are only looking for friendship or activity partners.) Further description of the data is given in Appendix A.

B. Community detection
The primary technical tool employed in our analysis is community detection [34], which takes a network of nodes and the connections, or edges, between themusers and messages in the present context-and divides it into tightly knit groups such that most edges fall within groups and few fall between. The most widely used method for community detection is modularity maximization [34, 50], which makes use of the standard quality function known as modularity [51]. This function, defined as the fraction of edges within groups minus the expected fraction of such edges if edges are placed at random, is large and positive for divisions of a network into good communities and small for poor divisions. Modularity maximization finds good communities by looking for the division with the largest modularity score. In our analysis of the complete, nationwide network of messages between active users, Fig. 1, we make use of modularity maximization on the weighted network of conversations between users in different 3-digit ZIP codes. There are a range of practical methods for performing the maximization itself. In our calculations, we use the Louvain algorithm of Blondel et al. [52], which is an iterative greedy algorithm that has been shown to give high-quality results with short run times [53]. We use the implementation from the Gephi network analysis package, with resolution parameter equal to 0.65, which results in the 19-community division shown in Fig. 1.
Though it is fast and gives good results, modularity maximization is not ideal for the community analysis of our individual city networks. This is because, as discussed in Section II B, these networks contain both assortative and disassortative structure. Modularity maximization is normally capable only of detecting assortative structure. For this part of our analysis, therefore, we use an alternative community detection method based on maximum-likelihood fitting of a generative, communitystructured network model, the degree-corrected stochastic block model [39]. In this approach one defines a model that generates networks with community structure, then fits that model to the observed network. The parameters of the best fit tell us which nodes of the network belong to which communities. More specifically they give us the posterior probability that each node belongs to each community; in the final stage of the calculation we assign every node to the community for which it has highest probability of membership. The fitting itself is performed using an EM algorithm, with the E-step carried out using belief propagation [40]. Technical details are given in Appendix B. Code is available upon request from the authors.

Appendix A: Data
Our data come from a popular, free online dating site. New users of the site begin by creating a profile, which includes various socio-demographic information, and they can also answer a set of open-ended essay questions that ask them to describe who they are and what they are looking for. The only information a user is required to give is their login handle, age, sexual orientation, relationship status, and a 5-digit ZIP code identifying their location. After creating a profile, users can then view the profiles of others, as well as send and receive messages. Unlike other dating sites, that are largely driven by a matching algorithm, this site allows users to pursue mates relatively freely according to their own preferences.

Metropolitan areas
Our city-level results are based on data from four metropolitan areas-New York, Boston, Chicago, and Seattle. In the case of Boston, Chicago, and Seattle, we find a good choice of boundaries to be the standard Core Based Statistical Areas (CBSAs) established by the Office of Management and Budget [54]. For New York, however, the data clearly indicate multiple geographic dating markets within the larger metro area. Instead, therefore, we choose a narrower set of geographic boundaries for New York, the five boroughs of Manhattan, Brooklyn, Queens, the Bronx, and Staten Island.  Asian  8  11  4  6  3  4  7  9  Black  9  9  6  6  7  9  4  3  Hispanic  10  8  3  3  8  7  3  3  White  73  73  87  85  81  80  87 Table I provides summary statistics of users in each of the four cities, broken out by gender. As discussed in Section II B, the cities vary in the ratio of men to women on the web site, New York having the largest fraction of women, followed by Boston, Chicago, and Seattle, in that order. Recall that in Fig. 2B we found the older submarkets to be more female-heavy, while the younger submarkets tended to be male-heavy. Examination of the age distribution of men and women in each city [55] suggests that this is not merely a result of age-specific sex ratios in the overall user population. New York, for instance, has a surplus of women, which is most pronounced among younger users in their mid twenties, yet the submarkets for younger users still have significantly more men than women. (The remaining cities all have an overall surplus of men, which is most pronounced in the later 20s and early 30s.) These observations suggest that the submarket sex ratios observed in Fig. 2B are driven by users' mate seeking behavior, and not broader population demographics.

Summary statistics
In addition to the sex ratios, Table I also shows that cities differ in their overall market size and composition. New York is the largest market, followed by Chicago, Seattle, and Boston. We also observe some variation in the average number of initial contacts made by men and women in each city, as well as their reply rates. Consistent with other work [29][30][31], we see that men send more messages than women. However, men have a lower chance than women of receiving replies to their messages.

Appendix B: Network analysis
As described in Section II, the starting point for our results is community structure analysis of networks of reciprocated messaging between pairs of individuals. Our city-level analyses are restricted to the largest connected component of the network for each city, although in practice this has little effect since nearly everyone belongs to the largest component. In the network for New York, for example, the largest connected component contains 99.8% of all users.
Our analysis of the full, nationwide messaging network in Fig. 1 is based on standard modularity maximization, as described in Section IV B. The structure within our individual city networks, however, is more complicated, being partly assortative (with respect to submarket) but also partly disassortative (with respect to gender, since most messages are between a man and a woman). To correctly detect and classify this kind of mixed structure we need a more flexible detection method. The leading such method is the statistical inference method based on fitting the network to a stochastic block model [39,40,56,57], which is the approach we employ in this work. Specifically, we use the degreecorrected stochastic block model [39], which is a generative model of a random community-structured network as follows.
Let n be the number of nodes in the observed network (a number typically in the thousands or tens of thousands for the networks studied here). The degree-corrected block model allows us to create a model network of the same size by first generating n nodes, numbered from 1 to n, each of which is assigned to one of k communities or submarkets. The communities are numbered from 1 to k, and nodes are assigned to communities independently at random, with probability γ r of being assigned to community r, where the γ r are parameters we choose, subject to the normalization constraint k r=1 γ r = 1.
(B1) When all nodes have been assigned to communities, edges are placed at random between pairs of nodes, independently but with probabilities that depend on the communities to which the nodes belong, such that when all edges have been placed the number falling between any pair of nodes i, j is Poisson distributed with mean d i d j ω rs , where r and s are, respectively, the communities to which nodes i and j belong, ω rs are parameters that we choose, and d i is the degree of node i in the observed network that we are fitting (i.e., it is the number of connections node i has to other nodes). The inclusion of d i is what distinguishes this "degree-corrected" model from other forms of the stochastic block model. As we will see, the degree correction fixes the expected degree of every node within the model to be equal to the observed degree of the same node in the data, allowing the model to give significantly better fits to empirical data. This defines the "forward" process of generating a random network given the parameters γ, ω of the model. Using the model for community detection involves the inverse process of fitting the model to observed data so as to determine the values of the parameters that give the best fit. This we do by the method of maximum likelihood. Our undirected network of two-way communication between web site users is represented by an adjacency matrix A with elements a ij = 1 if there is an edge between nodes i and j and zero otherwise. It is straightforward to show that the probability, or likelihood, of generating the observed network from the model, for given values of the parameters γ, ω, is where c denotes the complete set of community assignments {c i } and the log-likelihood L (c) = log P (A, c|γ, ω) of generating a particular set of community assignments and edges is given by where δ rs is the Kronecker delta and we have neglected additive and multiplicative constants independent of the parameters, since they have no effect on the position of the likelihood maximum.

Expectation-maximization (EM) algorithm
To find the values of the parameters γ and ω most likely to have generated the observed network we wish to maximize Eq. (B2) with respect to the parameters. Direct maximization is cumbersome so we employ a standard trick from the machine learning toolkit. First, we maximize not the likelihood itself but its logarithm, log P (A|γ, ω), which gives the same result since the logarithm is a monotone increasing function of its argument and hence the maximum of the logarithm falls in the same place as the maximum of the argument. Then we apply Jensen's inequality, which says that for any set of non-negative quantities x i , we have where q i is any properly normalized probability distribution satisfying i q i = 1. The exact equality is recovered for the special choice Applying Jensen's inequality to the log of Eq. (B2), we find that where we have made use of Eq. (B3) for the loglikelihood L (c). Here q(c) is any properly-normalized probability distribution we choose over community assignments c, and q ij rs is the probability within that distribution that nodes i and j belong to communities r and s respectively, thus: Following Eq. (B5), the exact equality in (B6) is established, and hence the right-hand side maximized, when we make the choice Thus if we maximize the right-hand side of (B6) over possible choices of q(c) it becomes equal to the left-hand side, and if we further maximize the left-hand side with respect to the parameters γ, ω we get the answer we are looking for-the values of γ, ω that maximize the overall likelihood. Put another way, a double maximization of the right-hand side with respect to both q(c) and ω, γ will achieve our goal. At first sight, this appears to make the problem harder: we have turned what was previously a single maximization into a double one. But in fact the double maximization usefully splits the problem into two parts that separately are both straightforward, whereas the original combined problem was difficult. Maximization with respect to q(c) is achieved by making the choice (B8), as we have said. Maximization with respect to γ and ω can be achieved by simple differentiation. Note that the final sum on the right-hand side of Eq. (B6) does not depend on γ or ω, so it vanishes upon differentiating. Taking the derivative of the first sum with respect to γ r and ω rs while imposing the constraint (B1) then gives us and where q i r is the probability within the distribution q(c) that node i belongs to group r: the second equality being true for any value of j. The result is an expectation-maximization or EM algorithm for fitting the model to the observed network, requiring the simultaneous solution of Eqs. (B8), (B9), and (B10), which is accomplished by simple iteration. We first choose initial values of the parameters γ and ω, for instance at random, and use them to calculate the probability distribution q(c) from Eq. (B8). Then we use that distribution to calculate q ij rs and q i r from Eqs. (B7) and (B11), and thence to calculate improved estimates of the parameters from Eqs. (B9) and (B10). Then we recalculate q(c) again, and repeat until convergence is reached.
The end product is a set of best-fit values of the parameters to the observed network data. In addition to this, however, and crucially for our purposes, we also calculate a converged value of the distribution q(c), which, from Eq. (B8), is equal to (B12) In other words, q(c) is the posterior distribution over community assignments, the probability, given the observed data A and the best-fit parameter values, of any particular division c of the network into communities. The final step of the calculation is then to assign each node to the community for which it has the highest probability of membership, which is also equivalent to choosing the community for which q i r is maximized. This gives us our best division of the network into communities or submarkets.

Expected degree
A key feature of the degree-corrected block model is its ability to provide a good fit to networks with broad distributions of node degree (the degree of a node in a network being the number of connections it has to other nodes). Most empirical networks, including our messaging networks, have widely varying values of node degree and any model we fit to such networks must, at a minimum, be capable of capturing this variation.
The actual degree of a node in our model network can fluctuate from one realization of the model to another, since the model contains random elements. But the expected value of the degree of node i, for the best-fit values of the parameters γ, ω given in Eqs. (B9) and (B10), is always equal to the degree d i of the same node in the observed network. Thus the fitted network fits the degree distribution exactly apart from fluctuations. To see this, observe that the expected degree of node i in the model is equal to the sum of the expected number of edges d i d j ω ci,cj between node i and every other node j d i d j ω ci,cj , averaged over the distribution q(c) of community assignments, thus: where we have made use of Eq. (B7). Most nodes j, however, will be far from node i in a large network, so that the community assignments of i and j are essentially uncorrelated. This means that q ij rs = q i r q j s and the expected degree becomes where we have made use of Eq. (B10) in the first equality, Eq. (B11) in the second, and the trivial observation j a ij = d i in the third.

Belief propagation and the calculation of the posterior distribution
Elegant though the EM algorithm is for the community detection problem, it is not (yet) a workable method, because for all but the very smallest of networks is it not feasible to evaluate the posterior distribution q(c) directly from Eq. (B8)-the number of possible values of c is simply too large. The number of possible divisions of n nodes into k communities is k n , so a division of 10 000 nodes into, say, four communities would have 4 10000 10 6000 possible divisions, which is far more than can be enumerated by even the most powerful computer. Within the statistical literature, the standard way of circumventing this problem is to approximate the distribution q(c) using Markov chain Monte Carlo importance sampling, and that could be done here too. In our work, however, we use a recently-proposed alternative approach based on belief propagation [40, 58,59], which is significantly more efficient for the particular problem at hand.
The belief propagation method focuses on a quantity µ i→j r , called the belief, which is equal to the (posterior) probability that node i belongs to community r if we are not told whether there is an edge between nodes i and j, i.e., if we are given the entire adjacency matrix A except for the element a ij . The omission of this one matrix element is crucial to the method: it allows us to write a self-consistent set of equations for the beliefs that can be solved by numerical iteration. For the degree-corrected block model used here, the appropriate equations have been given by Yan et al. [59]: ω rs µ k→i s , (B16) and q i r is the one-node marginal posterior probability of node i belonging to group r defined previously in Eq. (B11). This probability can itself be calculated directly from the beliefs according to This gives a set of beliefs for the current values of the parameters γ, ω. Returning to the EM algorithm, we then use those values to compute improved estimates of the parameters from Eqs. (B9) and (B10). To do this, we first need to calculate the two-node marginal probabilities q ij rs from the beliefs, which we do as follows. Note that q ij rs appears only in the sum in the numerator of Eq. (B10) and that the sum involves only the values of q ij rs for node pairs i, j that are connected by an edge. (Those not connected by an edge have a ij = 0 and hence do not appear in the sum.) For pairs connected by an edge, q ij rs is by definition equal to where the parameters γ, ω are assumed given in each probability and A denotes the set of elements of the adjacency matrix excluding a ij (which is specified separately). But each term in this expression is now straightforward to write in terms of quantities we already know.
The probability P (a ij = 1|c i = r, c j = s, A ) is just the likelihood of the edge from i to j, which for our stochastic block model is and women. Almost all messages on the web site between heterosexual users looking for romantic relationships are between a man and a woman-well over 99%. Very few are between two men or two women. Our algorithm readily perceives this structure, reliably dividing the network into men and women without the need for us to identify the sexes explicitly. This "disassortative" structure is characterized by a matrix ω rs of probabilities that has almost all of its weight off the diagonal (most connections are between different groups) and virtually none on the diagonal (connections between members of the same group). In addition to this trivial structure, however, there is also the nontrivial group structure that we refer to as submarkets-the tendency of the population to break up into distinct communities of dating with relatively little message traffic between communities.
A practical upshot of this is that if we wish to divide our network into, say, four submarkets, we must actually instruct our algorithm to look for twice this number of communities (i.e., eight). If we do this, then it reliably finds four submarkets, each further divided into men and women.
In the calculations presented in Section II B we chose to divide each city into four submarkets, but divisions into other numbers of submarkets would also be reasonable. To explore the effect of varying the number of submarkets we have performed divisions of the networks into various numbers of communities. Figure 5 shows the results of several possible divisions of the New York network. (Similar patterns are seen in the other three cities.) The panels of the figure show the age distribution (men and women combined) for divisions into three, four, five, and six submarkets (which means six, eight, ten, and twelve communities in total, once the trivial division between men and women is factored in). As we can see, the primary effect of increasing the number of submarkets is to divide the population into more closely spaced age ranges, so that divisions into larger numbers of groups give a finer, more granular, picture of the market structure but the same overall behavior. As with all  Fig. 2B, in which the ratio of men to women becomes progressively more female-heavy as we move into the older submarkets, is duplicated in each case here, demonstrating that this is a general behavior, and is not particular to any one choice of the number of submarkets.
statistical analyses in which data are divided into bins, there is a balance to be struck between larger numbers of bins, which gives finer detail in the analysis, and smaller numbers of bins, which gives better statistics. Our choice of four submarkets per city gives a good picture of the overall behavior while maintaining sufficient statistical power for accurate analysis of the population within submarkets.
The systematic variation of the ratio of numbers of men and women among submarkets seen in Fig. 2B also extends to divisions into other numbers of submarkets, as shown in Fig. 6. As the figure shows, the pattern for the four-way division of Fig. 2B, whereby the sex ratio becomes progressively more female-heavy as we move into the older submarkets, is duplicated for divisions into three, five, and six submarkets as well.

Appendix C: Additional analyses and results
In Section II B we observed that minority women tend to be younger than white women in the same submarket, a trend that is particularly noticeable for black women. While the pattern holds across all of our four cities, it is most pronounced in Chicago. Here we provide additional details on the racial composition of Chicago users and insight into processes that give rise to the age differences we observe between white and black women in Chicago. We also examine whether the patterns observed in Chicago hold in New York, the other city with a sizable black population. Figure 7 shows the mix of ethnicities for men and All submarkets are predominantly white, which is consistent with the overall composition of the Chicago market. However, despite the fact that whites are older, on average, than other racial groups, they are disproportionately concentrated in submarket 2. Black users, especially black women, are overrepresented in the older submarkets. Figure 3 suggests one mechanism driving these patterns.
women in each Chicago submarket. The predominant group in all submarkets is whites, which reflects the overall composition of the Chicago user base. There is, however, systematic variation in the relative size of the minority population across submarkets. Black men and women are more prevalent in the oldest submarkets, which is surprising given that they are slightly younger, on average, than their white counterparts. One factor driving this is that the black women messaged by both black and white men are, on average, significantly younger than the white women messaged by men in the same submarket, and this phenomenon is most pronounced in the oldest submarkets. This tends to pull younger women into the older submarkets, and with them the men that they exchanges messages with. This helps explain not only why there is a surplus of black women in the oldest submarket, but also why these women are significantly younger, on average, than white women in the same submarket. Figure 8 extends our analysis of age differences in messaging by submarket and race (Fig. 3) to Boston and Seattle. The pattern is similar overall to that for New York and Chicago: age differences tend to be larger for first messages than for replies, and also larger in older submarkets. In submarket 4, for example, white men initiate contact with Asian women who are around 6 years younger than themselves on average, but receive replies from women who are only around 3.5 years younger. Also in line with the patterns for New York and Chicago, we see that within a given submarket non-white women tend to receive messages from older men than do white women; this is especially true in submarket 4.
There are, however, also some striking differences between the results for Seattle and Boston and those for New York and Chicago. In Boston and Seattle, women in submarket 4 (and for Seattle submarket 3 as well) display little tolerance for overtures from much older men. Note how in these cities women's replies are predominantly to men of similar age to themselves, despite the fact that men are messaging significantly younger women. Black women in Seattle for example are receiving overtures from black men about 3.5 years older than themselves on average, but reply primarily to men of about their own age. Notable exceptions to this behavior are messages from Asian men to Asian women, and from Hispanic men to Hispanic women, which appear to receive replies despite large average age differences. X X FIG. 8: Mean difference in years between the age of men of varying races in Seattle and Boston (vertical axis) and the women they message, by race of women and submarket (horizontal axis). Race is coded as: A = Asian, B = black, H = Hispanic, and W = white. The first two rows show the average age difference for, respectively, all initial messages sent in Boston and those that received a reply. The bottom two rows show the same patterns for Seattle. We observe zero instances in Boston where black women receive messages from Asian men in submarket 2, so these cells are marked with an X.
[54] A CBSA is defined to be an urban center of at least 10 000 people plus adjacent areas that are socioeconomically tied to the urban center by commuting.