The study and analysis of public policy continues to mature in the United States and around the world. Public policy as a field of study has been the only new academic discipline added to the National Research Council in the past 25 years. Close to 300 programs in public policy, public affairs, and public administration are currently offered at American colleges and universities (NASPAA, 2012). The Association of Public Policy Analysis and Management (APPAM) at present has over 2,000 individual members from countries worldwide (Personal correspondence with APPAM Executive Director, 10/22/2013). In the past year, the contact authors for roughly one-third (32.5 percent) of over 500 submissions to the Journal of Policy Analysis and Management (JPAM), a journal focused on policy analysis and policy research, were from authors in foreign countries, with this percentage sharply higher if the residence of co-authors is taken into account. The discipline of policy analysis is clearly recognized globally.
The scholarly dissemination of research is a critical aspect of public policy's development as a discipline. Much of the applied research on public policy falls under the broad category of policy analysis. Textbooks and "how to" guides on the topic of policy analysis have been used in the training of students and practitioners for decades. Academic journals such as the Policy Studies Journal and JPAM provide outlets for high-quality empirical policy analysis. Government agencies, foundations, nonprofits, and advocacy groups also generate a wealth of policy analysis although the quality varies widely.
A distinction exists between the empirically-oriented analyses conducted by scholars of public policy and management, which are typically published in academic journals and books, and the types of public policy analysis regularly conducted by practitioners and policy makers themselves (which are generally found in white papers, agency reports, and other organizational publications). Weimer (2009) distinguishes between "policy research," which has the "express purpose of informing… policy," and "policy analysis," the purpose of which is to "assess systematically the alternative policy choices that could be made to address any particular problem" (p. 93). In light of these definitions, the research we discuss in this review could be labeled "policy research." On the other hand, the focus of our analysis is not on the research itself but rather we focus on the broader changes in how public policies are analyzed. Applied policy analysis and policy research are evolving in concert in terms of the sources and means of acquiring data and developing methodologies. Thus, since we intend to speak to new developments in the analysis of public policies, whether to test prevailing theory or to identify the best policy alternative, we hereafter use the term "policy analysis" to connote this broader field of both policy research and analysis.
Conventional discussions of policy analysis trends often discuss the emerging econometric techniques used to tease out causal relationships between a policy and its effect on an outcome of interest. We take a different approach here by first looking at how our understanding of randomization is evolving, and then at how new types of data used in policy analysis have the potential to disrupt and transform the status quo. In light of the diverse sources and types of non-experimental data we highlight in this paper, we base our analysis of data trends on the following question: What data are available to carry out empirical policy analysis and what are the tradeoffs of such data?
We consider the topic of randomized experiments relevant to any discussion about trends in policy analysis driven by the proliferation and development of data. The issue of experimentation and random assignment typically frames the discussion and understanding of experimental and non-experimental methods alike, and shapes how we assess the validity and usefulness of all data. Thus, any discussion of data is largely indivisible from that of social experimentation and random assignment. To that end we first summarize a recent debate waged in the pages of JPAM on the role of social experimentation and policy analysis. We then segue to our discussion on three types of data that are driving innovation in policy analysis: state and system-wide administrative data, spatial data (particularly related to the use of geographic information systems (GIS)), and Big Data. Finally, we conclude with a discussion of future directions and areas of growth in the field of policy analysis and policy research.
Experiments and random assignment are not new to policy analysis. What is new, however, is the increasingly heated debate over the extent to which randomized controlled trials (RCT) ought to take precedent over other research designs using either quasi-experimental or observational approaches. This debate is rather complex; most scholars agree that randomized controlled trials constitute the "gold standard" for identifying causal effects but take issue with one another over the extent to which: (1) scholars should prioritize the use of random assignment; (2) credence should be given to analyses based upon observational data; and (3) the collective focus on random assignment which may neglect important research questions for which randomization is not feasible.
The debate over random assignment in policy research was recently hashed out in "Point/Counterpoint" exchanges between Robinson Hollister and Richard Nathan and subsequent reader responses published in JPAM over two issues (27 and 28). The framework for the debate was based on four questions posed at the outset of the "Point/Counterpoint" exchange:
Initially, Hollister argued in favor of random assignment while Nathan took a more nuanced approach, essentially concluding random assignment as a "proper, but limited role" in policy research (Nathan, 2008a, p. 410). We summarize a few of the most salient points about randomization made in the "Point/Counterpoint" exchange here and encourage readers to review articles in the debate for a more in-depth discussion on the topic (e.g. Berlin & Solow, 2009; Cook & Steiner, 2009; Greenburg, 2009; Hollister, 2008, 2009; Nathan, 2008a, 2008b, 2009; Pirog et al., 2009; Wilson, 2009).
Fundamental disagreements emerged in the Hollister-Nathan debate on the use of random assignment and experiments. One of the main areas of disagreement revolved around the extent to which random assignment ought to take precedence over other methods of policy analysis. Hollister argued that random assignment experiments should be the "method of first resort" whenever "one wishes to assert that the estimated impact is caused by the policy or institution" (Nathan, 2008a, p. 402). Nathan (2008b) countered that an "over-reliance on randomization" has "crowded out valuable nonexperimental types of public policy research" including implementation research, process evaluation, and policy analysis concerned more with societal trends carried out by demographers, political scientists, historians, and sociologists (p. 608).
Another point of discord was the perceived cost of conducting randomized experiments. Of claims that experiments are more expensive than nonexperimental methods, Hollister notes "that this simply is not true" (Nathan, 2008a, p. 404, emphasis his). Nathan (2008b) responds to this contention with unease and points out that Institutional Review Board processes, managing treatment and control groups, and collecting in-program and post-program data are all processes which are "time consuming and expensive" (p. 609). Greenburg (2009) also weighs in on this question of costs associated with policy experiments and concludes that "the experimental approach would not usually be much more expensive than nonexperimental estimation" (p. 174) although he cites no evidence to support this point. These unsubstantiated claims about experiments' costs on both sides of the debate demonstrate that an empirically-driven study of experiments' costs and benefits would make a valuable contribution to our field.
Hollister and Nathan agree on little but a few points emerge on which they concur. Both agree that the much-exalted Perry preschool experiment is one of the worst examples of random assignment due to design flaws and unreliable estimates (Nathan, 2008a, 2008b). The authors also agree that most statistical alternatives to random assignment are problematic with the exception of regression discontinuity designs, which they consider "a promising alternative to random assignment" (Nathan, 2008b, p. 608). Others (Cook & Steiner, 2009; Pirog et al., 2009) also respond to this question of viable alternatives to random assignment. Cook et al. (2008) and Cook and Steiner (2009) identify ways in which carefully designed anlayiss can furnish valid causal conclusions from observational data. However, in the Point/Counterpoint section, Cook and Steiner's (2009) also note that the requisite conditions for such approaches "are not common," (165), require "more numerous and more opaque suggestions" (166), and that most observational analyses do not use such approaches. Pirog and colleagues (2009) find that "regression discontinuity designs can replicate some random assignment studies" but statistical corrections in general "do not uniformly and consistently reproduce experimental results" (p. 171). Finally, Hollister and Nathan agree that the Institute for Education Sciences' (IES) focus on empirical policy analysis (a topic discussed by Carlson, 2011) is valuable although Nathan (2008b) expresses concern that a focus on randomized experiments comes at the expense of nonexperimental research in education.
Berlin and Solow (2009) astutely point out that a debate on the merits of random assignment misses the key question, "What do policymakers need to know and what methods are most appropriate for answering their questions?" (p. 175). In other words, scholars designing and carrying out policy analysis should employ the best method available, whether random assignment or a statistical correction thought to closely approximate random assignment (see Cook et al. 2008; Shadish et al. 2008), to address important and interesting policy-relevant questions. We agree with Berlin and Solow's stance, but we would modify this question to instead ask, "What do policymakers need to know, what methods are most appropriate for answering their questions, and what data are available to carry out empirical analysis?" We turn now and focus on this third point, the question of what data are available for policy analysis, for the remainder of this article.
In this article we examine an ongoing shift in policy analysis from an emphasis on how we analyze data to a focus on how we create, gather, and manage data. The basic goal of policy analysis is to examine the "true" effect of a given policy intervention. The fundamental challenge, however, is that unlike laboratory test tubes or other traditional experimental mediums, no two individuals, households, wetlands, cities, schools, or any other object of policy analysis can be considered identical. Thus, policy analysis must always contend with unobserved counterfactuals: (1) what would have happened had the untreated subject received the policy intervention? and (2) what would have happened had the treated subject not received the policy intervention? Holland (1986) famously refers to this as the issue of counterfactuals as the fundamental problem of causal inference.
In the traditional policy analysis paradigm, analysts and policy scholars have sought to answer these questions using methodological strategies intended to remedy the flaws and shortcomings of existing data. Thus, various econometric techniques and other complex statistical methodologies have proliferated in recent decades as means by which to deal with incomplete, insufficient, or potentially biased data. For instance, imputation techniques have been developed to address incomplete or missing observations (Allison 2001; Enders 2010; Rubin 2004). Likewise, econometricians have put forth such methods such as instrumental variable (IV) models and propensity score matching (PSM) to deal with estimation problems related to selection bias, non-random treatment assignment, and other violated model assumptions. Policy analysts continue to hone and use such techniques. The recent policy analysis literature continues to produce state-of-the-art examples of this (Dahan and Strawczynski 2013; Frumkin et al. 2009; Kim 2013; Yoruk 2012).
However, while IV models, PSM, and other techniques attempt to compensate for non-random selection, there is little evidence that these advanced methods, for all of their complexity, are able to compensate for flawed research designs (Couch & Bifulco 2012; Pirog 2009; Pirog et al. 2009). It is important to distinguish in this case between issues pertaining to data and issues pertaining to research design. Imputation techniques, for instance, are intended to account for problematic data (i.e., data that are missing, incomplete, or inaccurate). In the case of methods such as PSM, the data are not necessarily flawed in the sense of being incomplete or inaccurate; rather, these research methods are not necessarily able to mitigate the potential of endogeneity amongst treatment and response variables even with the most accurate and complete data. In this way, we observe that data and research design are intimately related, as policies and programs are increasingly implemented with a careful focus on the data that will be produced and how it will be analyzed. Scholars are also striving to leverage data more effectively by making them more accessible and more compatible, and using new types of data to measure policy outcomes of interest. As we describe in this paper, these efforts take many forms.
First, many scholars and analysts are taking advantage of advances in data storage and accessibility in ways that even a few years ago were unfeasible. Researchers from across the country and the world are able to remotely access data held by state agencies and even county and city governments. In many cases, these data contain identifiers that allow analysts to combine data from different sources. Further, analysts and administrators alike increasingly design and implement policies and programs with policy analysis in mind. Thus, policies and programs are producing data that are more extensive, accurate, and accessible.
The policy analysis community has also begun to recognize that many – if not all – policies have a significant spatial component. In other words, the distribution and location of subsidized housing, schools, or animal preserves matters greatly; simply treating each observation as independent of one another can thus prove misleading. In particular, neighboring schools, jurisdictions, property owners, or even neighborhoods themselves greatly affect each other's outcomes. Thus, policy analysts are increasingly using geographic information systems (GIS), spatial data, and spatial modeling approaches to analyze policies. These analyses are better able to account for effects related to space and place that elude traditional statistical analysis. In what follows, we discuss these trends in greater detail. We also discuss a new type of data, Big Data, as well as its associated methods, which we believe will become a critical tool for policy analysts in future years.
In this section we first examine the use of state and system-wide administrative data in policy analysis. State efforts to collect detailed records for all program participants in areas such as healthcare, education, and social services provide researchers with a vast store of data. Second, we examine the use of spatial data and geographic information systems (GIS), which enable analysts to link diverse data sets and better account for spatial relationships. Third, we tackle the topic of Big Data, the U.S. government's initiatives to promote Big Data, and its potential role in policy analysis. We present these three types of data in the order for which we find the greatest amount of existing policy analysis. In other words, we find the most examples of policy analysis using administrative data, far fewer examples of policy analysis using geographic information systems, and no policy analysis that yet leverages Big Data. We include Big Data, however, because we are certain that policy analysts are on the cusp of harnessing its significant potential.
The use of administrative data is not new per se. Administrative records from state and federal agencies have been a data source of policy analysis for decades. What is new, however, is the growing effort by states to collect vast amounts of data, link records across multiple data sources, and provide these data to researchers for policy analysis. This innovation in administrative data collection answers the call made long ago by Jabine & Scheuren (1985) for public agencies to design internal "data systems [that] facilitate links with other systems" (p. 387).
Statewide longitudinal data linked across multiple agencies remedy a primary challenge relative to the use of national data for policy analysis. Whereas national data sets rely on a representative sample of the population, a statewide longitudinal data set typically contains records for either: (1) all residents in the state where program participation is the variable of interest; or (2) all participants in a program where program characteristics are the variables of interest. Longitudinal data provide a significant advantage in that they track outcomes for a given set of observations over time, allowing researchers to identify long-term effects and thus provide a more comprehensive understanding of policy impacts. Current efforts to create and maintain longitudinal datasets are focused at the state level; in fact, federal funding is often provided to state agencies to support development of administrative datasets at the state level. This is why, though Ha and Meyer (2010) acknowledge that their study of child care credits in Wisconsin is limited by using data from only one state, they argue that "state administrative data are the most reliable type of data to examine the dynamics of subsidy use and factors associated with it" and go on to note that, "no national survey data exists that include detailed individual-level information about families' monthly receipt of subsidies, employment patterns, and earnings in a longitudinal form" (p. 348).
Efforts to link individual-level records are increasingly common in education and workforce-related research. This trend is largely attributable to the Statewide Longitudinal Data Systems (SLDS) Grant Program authorized by the Educational Technical Assistance Act of 2002 (ETA) and later bolstered by the American Recovery and Reinvestment Act of 2009 (ARRA). In 2005 the Institute for Education Sciences (IES) began awarding competitive grants to states. All states by 2013 had received an SLDS grant except Wyoming, New Mexico, and Alabama (NCES, 2013a). A similar effort to encourage states to collect administrative data across agencies was introduced in 2009 by the U.S. Department of Labor. The Workforce Data Quality Initiative (WDQI) focuses primarily on workforce longitudinal data systems although efforts include matching individual records to education data. WDQI is a smaller program than SLDS: By 2013, 31 states had received a total of $31 million in WDQI grants whereas more than $300 million in SLDS grants has been awarded to 47 states (NCES, 2013c). Additionally, the US Department of Health and Human Services' Administration for Children and Families (ACF) program has provided funding to enhance states' TANF and related administrative data for operational, administrative, policy development and research purposes (Wheaton, Durham and Loprest, 2012).
Washington State provides a representative case study in examining how data are linked across agencies. Washington received approximately $23 million in SLDS funds in 2009 and a $1 million WDQI grant in 2012 to design and implement a data sharing system across the state's agencies (NCES, 2013b). Housed in the state's newly created Education Research and Data Center, these data linked records across eleven state agencies (Table 1). The resulting data from this collaboration across agencies provides a rich set of variables for analysts to use in measuring policy effects in Washington.
Unfortunately, Washington's relatively recent receipt of federal support for longitudinal data systems means there is little published policy analysis using these data.
Policy analysis using administrative data is common (Mueser, Troske & Gorislavsky, 2007) but to highlight the extent to which researchers leverage the potential of states' administrative records we turn to recent policy analysis conducted by Colman and Joyce (2011). In their study of the effects of a Texas law (the Woman's Right to Know Act) on abortion rates after 16 weeks' gestation, Colman and Joyce (2011) collected individual-level abortion records from the Texas Department of State Health Services and the states surrounding Texas from 2001 to 2006 to determine the law's effect. The data involved in this analysis were staggering; the authors collected 451,174 records of Texas residents and to those data added information on the number of abortions obtained by Texas residents in Arkansas, Kansas, New Mexico, Oklahoma, Colorado, Mississippi, Missouri, and Tennessee (p. 781). Census data were also used to provide population estimates by year, state, age, and gender. The authors concluded that the law led to an 88 percent drop in abortions at or after 16 weeks and a quadrupling of the number of residents who left Texas to obtain the procedure in neighboring states (p. 775). These findings, with clear implications for state and federal policymaking, demonstrate the value of empirical policy analysis based on vast stores of administrative data.
Geo-spatial data are another way in which increased data accessibility and capabilities are motivating the evolution of policy analysis. Analysts and policy-makers increasingly recognize that location – both place-specific context and spatial relationships amongst places – play a highly significant role in policy effects and outcomes (Wise & Craglia, 2008). Perhaps more importantly, analysts increasingly have access to data, resources, and tools that can address such issues. While geographic information systems, or GIS, are certainly not new, mobile technology, web applications, and data storage advances have served to make GIS a tool that policy analysts can leverage without needing access to certified GIS technicians or expensive personal software licenses. Further, increased spatial data production and availability enables analysts to conduct spatial analyses such as geographically weighted regression and use geocoding to spatially link observations and covariates. This latter effect is where we observe recent growth. Many policy analyses take advantage of geocoding to incorporate disparate databases and leverage this additional information. This follows the wider trend noted by Masser (2010), that current emphasis on spatial data within the realm of policy analysis is primarily related to making spatial data accessible to a wider range of users and enabling linkage between different databases and different types of data.
As an example, consider Nelson's (2010) analysis of the relationship between credit scores and residential sorting. The residential sorting model used in this study requires that individual household and property data be linked to local neighborhood attributes. Nelson (2010) accesses a mortgage data set of 16,805 observations from a Southern California bank. Traditionally, Nelson might have used the census tract code associated with each mortgage and simply linked observations to tract-level attributes. However, tract-level data would be suboptimal considering the kinds of highly localized, neighborhood-specific effects attributes that are hypothesized to influence residential sorting.
Armed with the actual address of each house, however, Nelson uses GIS software to link each mortgage to neighborhood-specific socio-economic data. These covariates in turn truly tell the story of how spatial data (and GIS software) have transformed the way we do policy analysis. Nelson (2010) links each mortgage to: (1) local school quality metrics from the California Board of Education Datasets (CBEDS); (2) private school availability data from the California Private School Directory; (3) school-district specific home data from the School District Data Book; (4) aggregate property-type data from observed mortgages; (5) racial composition data from CBEDS; (6) land use data from the California Fire and Resource Assessment Program; (7) crime incidence data from the California Crime Index; and (8) weather and air quality data from RAND Corporation. In the end, Nelson (2010) is able to demonstrate that credit scores are a major driver of residential sorting "over and above income" (64). Credit scoring practices "disparately impact racial minorities" (64), thus heightening residential segregation. This level of analysis is only feasible due to the geo-code identifiers associated with each observation. These data enable Nelson to link observation to specific on-the-ground place and to account for the relative geographic location of each observation (We also note that these data are another salient example of the extent that researchers leverage administrative data from multiple sources).
Enabling a finer spatial scale is not the inherent value presented by GIS data; in fact, many of the datasets from which Nelson fits covariates into the model are of a scale greater than census tracts. Rather, the benefit of this approach is in using the spatial location of each observation to account for the spatial distribution and relative location of each observation (as opposed to linking each data point to a unique nominal identifier of any scale, such as by school district or street corner). In particular, given that residential sorting is itself a spatial process in that the composition of adjoining neighborhoods are not independent of one another, analyzing these data in a spatial framework is key to understanding policy effects.
The ability to model the spatial distribution of observations and use geocoding to link disparate datasets is a powerful development but the type of data contained in GIS is an additional tool which enhances the capacity of policy analysis. Whereas traditional datasets represent information using numbers (e.g., age, test score, tons of carbon released) or perhaps character strings (e.g., city, college, species), GIS store data as points, lines, and polygons. For instance, houses or high schools are stored as points (each with an associated geocode reflecting its physical location) with associated attributes (e.g., population, number of students) attached. Rivers, highways, or public transit routes are stored as lines. States, watersheds, or school districts are then stored as polygons, representing their spatial extent. This is a completely different way of representing the world and thus enables new dimensions to policy analysis.
Andersson and Gibson (2007) represent this new dimension in spatial policy analysis by using GIS to analyze how decentralized governance impacts deforestation in Bolivia. Andersson and Gibson (2007) compare satellite land cover and land use imagery from different time periods, overlaying these images on a polygonal representation of local jurisdictions to examine how municipal governance characteristics (property rights, technical capacity, field presence) drive – or prevent – deforestation. Since the Landsat1 data are images, it simply would not be feasible to conduct this analysis using a traditional municipal dataset, which might simply contain an indicator for each municipality and associated covariates such as land size, governance type, etc.; thus, GIS – and their ability to represent the spatial extent of various phenomenon – are critical. GIS is used to develop several different dependent variables for this analysis, including a measure of all deforestation between 1993 and 2000, a measure of deforestation on lands specifically designated for other allowed usages (using an overlay of land ownership and land use zoning), and a measure of deforestation on protected lands (using as similar overlay). Since GIS enables Andersson and Gibson to treat municipalities as polygons, instead of simple observations, they are thus able to examine how cities perform on each of these measures. This allows Andersson and Gibson to conclude that local institutional structure and performance affects unauthorized deforestation, but has little effect on either permitted (i.e., authorized) or total deforestation (Andersson and Gibson 2007, 99). These findings highlight the critical role that GIS plays in this analysis. Since total deforestation was largely invariant across localities and unauthorized deforestation obviously does not show up in standard administrative data, Andersson and Gibson would not have been able to analyze the role of local institutions without the ability to map and account for deforestation using remote sensing and GIS analysis. In this vein, it is important to note that using GIS does not mean that analyses are limited to presenting maps and figures; GIS enables analysts to conduct robust statistical analyses. In fact, the Andersson and Gibson (2007) example above does not present a single figure, instead using statistical features available within GIS to produce quantitative metrics representing spatial outcomes, which Andersson and Gibson then use within a series of two-stage least-squares regressions.
These examples demonstrate that GIS is enabling policy analysts to address questions using analyses that previously were not feasible. We may not think of public policy as "spatial" but in reality virtually all policy interventions are affected both by specific place-related contextual factors as well as factors related to neighboring places. Thus, we observe similar analyses related to land-use regulation and rental housing (Schuetz 2009), job creation and business enterprise zones (Kolko and Neumark 2010), traffic congestion pricing (Harsman and Quigley 2010), mapping jobs to urban ethnic enclaves (Liu 2010), spatial regression of mortgage assistance program effects (Di et al. 2010), and hazardous waste enforcement in low-income and higher minority locations (Konisky 2009).
The use of GIS and spatial data are highly prevalent in disciplinary journals. GIS-based policy analyses are published frequently in geography-related journals, such as the Journal of Transport Geography; for instance, Macharis and Pekin (2009) analyze how transportation regulations and related policies in Belgium affect the intermodal transportation market. Such analyses are by no means limited to geographically oriented journals, however. In the Journal of Forest Policy and Economics, Gaither et al. (2011) use GIS to mesh spatial data concerning social vulnerability and wildland fire vulnerability in the southeastern United States to identify "hot spots" of vulnerability and assess how mitigation programs compare inside and outside of these particularly vulnerable areas.
Also worth noting is the development of journals specifically oriented towards the use of spatial data in policy analysis, such as the Journal of Applied Spatial Analysis and Policy, first published in 2008. In a recent article, Buckner et al. (2013) examines how demographic change in Northern England will impact existing infrastructure for housing and health care (such as long-term care facilities and assisted living homes). A more traditional analysis might assess the number of "beds" or "rooms" available in comparison to projected demographic numbers. However, the location of services also plays a considerable role in determining policy sufficiency and effectiveness. The use of spatial data allow Buckner et al. (2013) to compare the spatial distribution of social services and care facilities with the spatial distribution of people who need – or are projected to need – such services, thus providing a more nuanced picture of capacity and infrastructure challenges.
The term "Big Data" represents multidimensional information, often on a massive scale, that is difficult to process or analyze using conventional empirical and statistical tools (Eaton et al., 2012). Two factors have driven the exponential growth of Big Data in recent years. The first factor is digitization. The digitization of images, video, environmental sensor readings, online behavior, purchases, social media, and smart phones (to name a few) generate billions of datum each day. The development of cloud computing is the second factor driving Big Data. When data exist "on the cloud" the data are housed across networks instead of in a single physical location. Cloud computing has expanded the storage capacity of data beyond conventional measures. Indeed, Bollier (2010) observes that part of the enthusiasm around Big Data has been on storage capabilities and less about "superior ways to ascertain useful knowledge" (p. 14).
Eaton and colleagues (2012) present three characteristics of Big Data – volume, variety, and velocity – that are useful in distinguishing Big Data from the conventional data used in extant policy analysis. We briefly discuss each of Big Data's characteristics then turn to a discussion of Big Data and policy analysis.
The volume of Big Data verges on incomprehensible. Consider that kilobytes (103) describe the number of bytes in a typical word processed document and a CD-ROM holds 600 megabytes (106) of data. Big Data are measured in terabytes (1012), petabytes (1015), exabytes (1018), and yottabytes (1024). To contextualize these figures, in early 1997 the entire World Wide Web contained approximately two terabytes of textual data (Lesk, 1997, 218); in 2013, Twitter users generated more than three times this amount of data (7 TB) each day (Eaton et al., 2012, 5).
Big Data's structure, or lack thereof, is another distinctive characteristic. Standard quantitative data used in policy analysis and econometrics is relational. Relational data are information stored in tables as rows and columns where values in certain cells may link to additional tables. Non-relational data do not fit this model. Instead, the term non-relational captures the text, environmental sensors, video and audio, and transactional data that are dynamic and not easily categorized in rows and columns. This variety presents obvious empirical challenges for traditional policy analysis.
"Velocity" is the third characteristic that separates Big Data from conventional policy analysis data. In a given data collection effort for traditional policy analysis a researcher would likely focus on the rate at which data were accumulating. This accumulation of data, and ultimately reaching a certain amount of data (i.e. a predetermined sample size n) is important because the researcher likely needs a minimum amount of data to carry out a particular statistical analysis. With Big Data, on the other hand, researchers are interested in the flow of data. In other words, the speed at which data are created, not the accumulation of data, matters. The focus on the velocity of data is largely driven by Big Data's short shelf-life, a topic we discuss later.
Turning to policy analysis applications of Big Data, after a thorough search of the literature we find no policy analysis which uses Big Data. Like others (e.g. Choi & Varian, 2012) we are optimistic, however, that policy analysis will soon leverage the opportunities found in Big Data. We surmise that two sources of data in particular – Google search data and data from federal agencies – will potentially inform future policy analysis.
In an analysis of internet search trends, Ripberger (2011) notes that Google controls 86% of the worldwide market and 70% of the U.S. market for internet searches. This market share allows Google to aggregate immense amounts of data from the more than one billion searches carried out daily around the world (Google, n.d.). Among sources of Big Data for policy analysis, Google is unique because much of the search data collected is publicly available (at least at a highly aggregated level) through Google Trends. Google Trends indexes, both daily and weekly, the volume of queries that users search through Google.
Using Google data for "now-casting" is increasingly common with implications for policy analysis. Now-casting predicts the present (Choi & Varian, 2012) instead of using present-day data to forecast future policy-relevant outcomes. Google Flu Trends is a salient example of how "predicting the present" has potential as an effective policy analysis tool. Google Flu Trends aggregates search terms such as "flu" and "cold/flu remedy" by geography to track influenza rates in the U.S. (Ginsberg et al., 2008). Flu trend data detect regional outbreaks of influenza 7-10 days before the Centers for Disease Control and Prevention is able to do so with conventional surveillance systems (Carneiro & Mylonakis, 2009). The flu trend data also have a high correlation between city-level search trends and emergency room visits and waiting time (Dugas et al., 2012).
We see a clear opportunity for policy analysts; if a healthcare policy, for example, was implemented in certain cities or areas of the country, Google Trends could provide a massive amount of internet search data (i.e. Big Data) that could provide insight to the policy's effects on individual information-seeking behavior and other associated behavioral outcomes. Since Choi and Varian (2012) note that job-seeking behavior on Google also correlates closely with unemployment measures, the effects of workforce development policies are another area which may be empirically explored with internet search data.
A 2010 report to the president and Congress argued that every federal agency needs a Big Data strategy (President's Council of Advisors on Science and Technology, 2010). In 2012 the Obama administration embraced this recommendation by announcing the "Big Data Research and Development Initiative" (White House, 2012). The hallmark of the initiative is the dedication of $200 million by six federal departments and agencies toward efforts that advance the technologies needed for federal agencies to manage, analyze, and share Big Data. The departments and agencies involved with the Big Data initiative are the National Science Foundation (NSF), Health and Human Services/National Institutes of Health (HHS/NIH), the Department of Energy (DoE), the Department of Defense (DoD), the United States Geologic Survey (USGS), and the Defense Advanced Research Projects Agency (DARPA).
The potential uses of Big Data provided by federal agencies are numerous; for illustrative purposes we outline two here. The digitization of health records presents an obvious opportunity for the evaluation of healthcare policy at the federal and state level. Aggregating healthcare data and tracking policy-relevant health outcomes can extend HHS/NIH research into hard-to-reach populations that might have otherwise been difficult to reach using conventional data collection efforts. Data collected from environmental sensors is another area which can potentially provide a wealth of information to policy researchers. With a network of sensors around the globe constantly feeding real-time data to scientists, the USGS has the capacity to measure policy effects ranging from global warming initiatives to how local watershed policy affects the organic material of streams and rivers (USGS, 2012).
We close this discussion on Big Data with mention of promising trend and a cautionary observation. The promising trend we observe is the development of courses and programs at U.S. universities focused on Big Data. The National Science Foundation, for example, has earmarked a $2 million award to support undergraduate, graduate, and postdoctoral training on graphical and visualization techniques using Big Data. There are also presently more than a dozen universities that offer courses on Big Data (Schutt, 2012). This academic focus on Big Data will provide the training for a next generation of policy analysts.
We offer a cautionary observation that policy analysts must be aware of the many ethical concerns that accompany the use of Big Data. The use of Big Data by policy analysts, government officials, and the private sector presents noteworthy issues related to privacy, civil liberties, and consumer freedom (Bollier, 2010). An individual's physical location can be easily tracked through Twitter use (Azmandian et al., 2013) and personal information collected online is increasingly easy to identify even when researchers purport to use anonymized data (Ohm, 2010; Tene & Polonetsky, 2012). Although any type of sensitive data requires conventional precautions to protect individuals' privacy, the constellations of variables in Big Data require added safeguards to ensure against confidentiality abuses.
Conventional wisdom suggests that increasingly complex statistical techniques will continue to drive policy analysis innovation. We argue, conversely, that the types of data discussed in this article present researchers and analysts with novel opportunities to analyze public policy in ways that have been previously impossible (or at least highly impractical) given empirical constraints. These new opportunities for empirical policy analysis are constrained not by statistics but rather by access to data granted by state and local governments and private companies (e.g. Google).
We believe our field would be better served by discussing and clarifying the distinctions and commonalities amongst these burgeoning sources of data. Administrative data, GIS data, and Big Data are not mutually exclusive but there are several distinctions that deserve attention. Administrative data are curated and maintained in a repository from which researchers can query or download data, making these data often applicable to numerous research questions. Conversely, Big Data typically requires an anterior determination of what data are needed for the research question at hand. For instance, researchers interested in the use of Twitter for disaster response cannot record and save all global Twitter activity, but must instead specify specific search terms or values, for instance "#OilSpill", which will cause a given tweet to be stored and saved. This clearly changes the way in which projects must be conceived and designed, and likewise changes the types of questions that can be asked since the researcher is essentially sampling from a giant data stream in real time.
GIS data are also not mutually exclusive from either Big Data or administrative data but key distinctions exist. Namely, GIS data must contain data about spatial location that can be analyzed via geographic information systems. Many administrative datasets available from agencies such as the United States Geological Survey are GIS data related to topics such as land cover or hydrography. Spatial data are stored as points, lines, polygons, and rasters (e.g., a matrix of cells analogous to the pixels on a screen) (UWCGIA 2013). Similar to administrative data and Big Data, the overlap between GIS data and Big Data increases as technological capabilities grow (ArcNews 2013).
Finally, the use of administrative, geographic, and Big Data for policy analysis do not address the role of random assignment in policy research. As we noted, the debate on randomization is multifaceted and will undoubtedly endure. We agree with Berlin and Solow (2009) that the questions posed by policymakers ought to determine the method of analysis. Randomized controlled trials will likely remain the "gold standard" in many cases. In other cases when a research question lends itself to nonexperimental methods, innovations in data represent a new frontier for enterprising policy researchers and analysts.
1 The Landsat program is a joint initiative by the United States Geological Survey (USGS) and the National Aeronautics and Space Administration (NASA), initiated in 1972, that uses a system of orbiting satellites to take imagery of various aspects and attributes of land use and land change.