Measuring Happiness Is Harder (But Maybe Also Easier) Than You Think

tl;dr: Experiential measures of subjective well-being solve, in a clear and obvious way, one possible threat to validity that affects global self-reports. The obviousness of this advantage, however, sometimes leads people to overlook the fact that experiential measures may have their own unique threats to validity. Specifically, the fact that experiential measures require respondents to repeatedly answer the same questions over and over again may pull for responses that deviate from what the researcher expects. Because we don’t know the relative impact of these distinct threats to validity, it is possible that, on balance, global reports provide a better way of assessing well-being.

I once heard a talk where someone used a reaction-time based measure to assess a psychological process. After hearing the claims about the psychological process, I asked what evidence the speaker had that this particular reaction-time based measure was valid. The speaker seemed surprised by the question. Based on his response, I think he assumed that reaction times were so obviously and inherently linked to the underlying process that no validity evidence was even needed. In fact, given the way he talked about the phenomenon, he seemed to think that the reaction times he collected weren’t measures of some underlying process as much as the process itself.

This example stuck with me because I see less extreme versions of this error somewhat frequently in research on subjective well-being. Specifically, when researchers compare global measures of well-being (measures like life satisfaction, which ask respondents to reflect on their life and make an overall judgment about how that life is going) to more experiential measures (where affect is assessed in the moment as it is happening), they sometimes take for granted that the experiential measures are somehow closer to the underlying process or construct and thus, require less evidence for validity. This is a problem because if the validity of a measure is assumed, then people may not bother to actually test it.

Testing the Validity of the Day Reconstruction Method

As I pointed out in my last post, the day reconstruction method [DRM] is an approach to measuring subjective well-being that is supposed to capture the richness of more traditional experiential measures like the experience sampling method [ESM], without the high burden that accompanies these alternatives. In the DRM, respondents are asked to reconstruct the experiences of a single prior day, dividing that day into specific episodes that are then rated for activities, experiences, and emotions. Because the time lag between experience and report is not long, and because the focus of the report is relatively constrained and concrete, it has been assumed that the DRM captures information that is quite similar to the ESM (Kahneman et al. 2004; see Robinson and Clore 2002, for a discussion of why this might just work). As I pointed out in the last post, however, the correspondence between the two approaches can vary pretty dramatically depending on the specific question that is being asked; and especially for within-person associations, convergence can be quite low (and you can take a look at the preprint for way more information than you’ll get in either of these two blog posts).

ESM Versus DRM

In addition to comparing ESM and DRM to one another, it is also useful to see how strongly they correlate with additional criteria. Unfortunately, deciding what measure to use as an external criterion is not an easy task, as there is no clear gold-standard measures of well-being that one can use to validate a new technique. In a series of papers, my colleagues and I have tried to examine the links between experiential measures and wide range of alternative forms of assessment and external criteria (not all of which will be covered in this post).

For instance, the table below shows the correlation between ESM and DRM measures of affect and four self-report measures of well-being (a single-item measure of life satisfaction; the Satisfaction With Life Scale; and global, retrospective measures of experienced affect), along with two informant-report measures of life satisfaction. A few patterns are worth pointing out. First, at least for experiential positive affect, correlations between the global self-reports and DRM measures tend to be stronger than the correlations with the ESM measures. For negative affect, on the other hand, the differences are very small (and nonsignificant); and for the informant-rated criteria, correlations with the two types of measures are virtually identical. So far, at least for these limited criteria, there may be a slight advantage for ESM versus DRM, but the differences are not large.¹

Table 1: Comparing the DRM and ESM to global self-report and informant-report measures of well-being

	LS	SWLS	Global PA	Global NA	Informant LS	Informant SWLS
ESM PA	0.47	0.47	0.51	-0.27	0.26	0.26
ESM NA	-0.33	-0.28	-0.33	0.40	-0.12	-0.11
DRM PA	0.38	0.36	0.38	-0.23	0.26	0.25
DRM NA	-0.31	-0.25	-0.32	0.36	-0.08	-0.09

Note. DRM = Day Reconstruction Method; ESM = Experience Sampling Method; PA = Positive Affect; NA = Negative affect; SWLS = Satisfaction With Life Scale; LS = Single Item Life Satisfaction.

ESM and DRM Versus Global Reports

Once we start thinking about comparing the validity of the DRM to that of the ESM, it is also possible to think about how both compare to the validity of global self-reports. Specifically, if we focus on informant reports as a useful criterion, we can test whether correlations are stronger with the DRM versus the frequently criticized global self-report measures. And in our studies, we consistently find that correlations with global self-report measures are at least as strong, if not higher than correlations with the DRM. For instance, the correlations with the two informant-reported life satisfaction measures are r = .37 and .35 for global positive affect and -.27 and -.25 for negative affect, values that are consistently (though not substantially) larger than those from the table above. Our study is not especially well-powered to detect whether these are significantly larger than the correlations with the DRM and ESM; but the same general pattern emerges in a different paper we recently published. The table below shows correlations from two studies that included global, informant, and DRM measures of positive and negative affect (Hudson et al. 2019).²

Table 2: Comparing the DRM and Global Reports to Informant Reports of Affect

	Study 1 PA	Study 1 NA	Study 2 PA	Study 2 NA
Global PA	0.35	-0.13	0.27	-0.33
Global NA	-0.17	0.30	-0.20	0.27
DRM PA	0.25	-0.09	0.19	-0.31
DRM NA	-0.15	0.21	-0.18	0.11

Again, although the differences are not large, for each construct, global self-reports did a bit better in predicting informant reports than did DRM-based measures. I don’t think informant reports provide the perfect criterion, and it is important to note that all of these correlations are pretty far from an r of 1.00, but the fact that global measures correlate more strongly with informant reports (i.e., have greater evidence of validity using this criterion) than do experiential measures is worth noting.

Could Experiential Measures Be Less Valid Than Global Measures?

So could experiential measures actually be less valid than global measures, despite their intuitively appealing features. I think that some additional results from our studies comparing the DRM to ESM, when combined with our prior work on within-person assessments, hints at some reasons why this might be the case.

Some People Report More Within-Person Variability Than Others

The problem is that variability in self-reports of one type of construct correlates too strongly with variability in ratings of completely unrelated phenomena.

The first piece of evidence comes from a seemingly unrelated set of studies: Studies on within-person variability in personality. The idea that some people are more variable—in their affect, their levels of energy, or even their personality—seems pretty intuitive. Many studies have examined within-person variability, assuming that it is a meaningful individual difference in its own right. However, my former student, Brendan Baird, showed that we should be concerned about what we get when we assess within-person variability using repeated self-report measures.

The problem is that variability in self-reports of one type of construct correlates too strongly with variability in ratings of completely unrelated phenomena. For instance, in our study on within-person variability in personality, we asked people about their personality in different roles (e.g., what are you like as a friend? What are you like as a worker?). We found, however, that cross-role variability in ratings of one’s own personality correlated .64 with cross-role variability in ratings of one’s friend’s personality, even though people didn’t see themselves to be that similar to their friends in terms of their average personality (Baird, Lucas, and Donnellan 2017). That seems too strong to me. Even worse, cross-role variability in one’s own personality correlated moderately (around r = .45) with variability in ratings of the personality of various Simpsons characters and even with variability in the desirability of various “neutral objects” like the local public transportation system, local speed limits, and 8 1/2 by 11 inch paper (for an open-access version of this paper, see here).³

This could be a problem if we are looking at the effects of within person changes in situations on variables like personality and affect, as those individuals with more variability may appear to be more strongly affected. In addition, depending on the nature of the situational factors that are being examined, within-person variability could affect overall mean levels that people report. Importantly, the associations were also moderate in size (though weaker) when actual experiential measures were used (daily-diary methods, in this case), and even ESM-based measures of variability correlate pretty strongly with the cross-role measures that were the primary focus (Baird, Le, and Lucas 2006). It does seem to be the case, though, that the effects are exacerbated when the situations or roles across which variability occurs are assessed in a single sitting (as is true with the DRM).

Simply Asking About Variability Causes More Variability

A second concern, and one that is probably more relevant to our assessment of the validity of experiential measures, comes from a different study that Brendan conducted, one in which he experimentally manipulated whether people were asked about their personality in multiple roles versus just one (Baird and Lucas 2011). The trick was figuring out how to assess the effect of this manipulation on reports of variability, because there is no variability when only a single role is assessed. So before participants came to the lab for the study, we administered a short personality questionnaire asking them what their personality was like “in general.” Then, once they got to the lab, half were asked to report on their personality in a single role (e.g., as a friend) and half were asked to complete the same questionnaire multiple times for multiple roles. We then compared the role-specific ratings to the general ratings that had been provided a few days before.

The results were pretty striking. Those people who reported on multiple roles showed greater discrepancies and weaker correlations with their ratings of what they were like in general. In addition, the effect of specific roles was inflated in the multiple role condition in a direction that, in our interpretation, was consistent with stereotypes about how people are supposed to behave. For example, the effect of the friend role on extraversion was more positive in the multiple-role condition than the single-role condition. It seemed as though participants in the single-role condition interpreted our questions about their role-specific personality as just another question about what they were like in general, whereas those in the multiple-role condition interpreted the same question as one about how they differ from what they’re typically like when in that role. This is consistent with a broader literature on how the questions we ask shape the answers that respondents provide (Schwarz 1999).

Back to the DRM

Why does this matter for experiential measures of subjective well-being? Well let’s look at a typical set of findings that emerge when these measures are used. In general, research like that reported in Kahneman et al. (2004) usually shows that DRM measures correlate less strongly with stable, person-level factors (like income, health, etc.) and more strongly with other situational variables that are measured at the same time that experienced well-being is assessed. This pattern is often interpreted to mean that people rely on their beliefs about what factors should matter for happiness (factors like income and health) when answering global measures, and that the experiential measures reveal that what really matters is how we spend our time. The problem is that the DRM looks a lot like our experimentally manipulated multiple-role condition—a condition that pulled for reports that differed from what one is like in general and as well as reports that exaggerated the impact of the specific roles that were made salient.

In fact, when we compared the DRM to the ESM in our new paper, the pattern looked pretty consistent with the effects of the multiple-role condition in Baird and Lucas (2011). I won’t go into detail here (this post is already long enough), but as I noted above, the DRM was often less strongly correlated with global self-reports (i.e., what one thinks they are like in general) than was the ESM. In addition, the DRM was more strongly affected by situational factors and more closely linked with global stereotypes about how people typically feel in various situations than was the ESM. By repeatedly asking “what were you doing and how were you feeling”—especially in a single sitting—researchers may be subtly communicating to participants that they are less interested in what the respondents are like in general and are more focused on participants’ opinions about the situations in which they find themselves. And although I think that this type of effect is likely to be most pronounced when using methods like the DRM (which are completed in a single sitting), it is possible that it occurs any time respondents are required to answer the same questions over and over again (including ESM).

Conclusion

Years ago, when I started doing research on subjective well-being, the story that you used to hear was that life circumstances don’t matter for well-being: How happy you are, it was said, is almost completely unrelated to the income you have, the health problems from which you suffer, or the number of positive and negative life events you’ve experienced. The very fact that these associations were so low was used as evidence that the measures simply could not be right (Schwarz and Strack 1999). This idea played into the counterintuitive narrative that people simple didn’t know, or at the very least couldn’t report, how happy they really were.

Over the years, the field started to recognize that this narrative resulted from weak studies with small sample sizes, combined with problems regarding the interpretation of effect sizes. Now we know that life circumstances do matter, and that global self-reports correlate in pretty intuitive ways with many life circumstance variables (e.g., Lucas 2007).

But around this time, another counterintuitive narrative began to take hold. As the number of studies that included the DRM (or other similar measures) grew, people began to find that these new measures tended to correlate very weakly with the stable life circumstance variables that had been the focus of so much well-being research. So now the narrative was that yes, life circumstances correlate with global well-being measures, but only because people believe that they should. When we assess actual well-being with experiential measures, we find the real association, which is pretty close to zero. Now the same evidence that was once used against global self-reports (their weak associations with life circumstance variables that simply had to matter) didn’t seem relevant when applied to the new class of measures. Moreover, the opposite set of results (the fact that global measures do correlate with life circumstance variables) was now being used as evidence against their validity.

via GIPHY

I want to be clear that I do not think that global self-reports of subjective well-being are flawless. Indeed, I think there is still a lot to be concerned about and a lot more research we need to do. The point of my two posts is this: Although experiential measures seem to be able to fix a few obvious threats to the validity from which global self-reports might suffer, there may be additional threats to the validity of these alternative measures that are rarely considered. And because we don’t know the relative importance of these various threats to validity, it is still quite possible that the often-maligned global self-report measures of well-being—measures that are inexpensive, that are simple to understand, and that can easily be incorporated into just about any type of study—are our best bet for assessing people’s subjective sense of how well their lives are going.

References

Baird, Brendan M, Kimdy Le, and Richard E Lucas. 2006. “On the Nature of Intraindividual Personality Variability: Reliability, Validity, and Associations with Well-Being.” Journal of Personality and Social Psychology 90 (3): 512–27.

Baird, Brendan M, and Richard E Lucas. 2011. “‘ . . . And How About Now?’: Effects of Item Redundancy on Contextualized Self-Reports of Personality.” Journal of Personality 79 (5): 1081–1112.

Baird, Brendan M., Richard E. Lucas, and M. Brent Donnellan. 2017. “The Role of Response Styles in the Assessment of Intraindividual Personality Variability.” Journal of Research in Personality 69: pp. 170–79. https://doi.org/10.1016/j.jrp.2016.06.015.

Hudson, Nathan W., Ivana Anusic, Richard E. Lucas, and M. Brent Donnellan. 2019. “Comparing the Reliability and Validity of Global Self-Report Measures of Subjective Well-Being with Experiential Day Reconstruction Measures.” Assessment. https://doi.org/10.1177/1073191117744660.

Kahneman, Daniel, Alan B. Krueger, David A. Schkade, Norbert Schwarz, and Arthur A. Stone. 2004. “A Survey Method for Characterizing Daily Life Experience: The Day Reconstruction Method.” Science 306 (5702): 1776–80. https://doi.org/10.1126/science.1103572.

Lucas, Richard E. 2007. “Long-Term Disability Is Associated with Lasting Changes in Subjective Well-Being: Evidence from Two Nationally Representative Longitudinal Studies.” Journal of Personality and Social Psychology 92 (4): 717–30. https://doi.org/10.1037/0022-3514.92.4.717.

Robinson, Michael D., and Gerald L. Clore. 2002. “Belief and Feeling: Evidence for an Accessibility Model of Emotional Self-Report.” Psychological Bulletin 128 (6): 934–60. https://doi.org/10.1037/0033-2909.128.6.934.

Schwarz, Norbert. 1999. “Self-Reports: How the Questions Shape the Answers.” American Psychologist 54 (2): 93–105.

Schwarz, Norbert, and Fritz Strack. 1999. “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications.” In Well-Being: The Foundations of Hedonic Psychology, edited by Daniel Kahneman, Ed Diener, and Norbert Schwarz, 61–84. New York, NY: Russell Sage Foundation.

Of course, it is worth pointing out that, as the prior post showed, for this type of question—estimating between-person differences in affect—the DRM and ESM seem to provide similar info, so the fact that the correlations are close is not surprising.↩︎
This study is also important because we administered the DRM three times over a two-month period and then modeled associations between stable latent traits. So the weaker associations with the DRM are not due to the fact that they assess only a single day’s worth of experiences.↩︎
I take the term “neutral objects” from the name of the scale we used for this task. But really, not all of these are actually neural. They are however consistent across all participants in our study, and it is difficult to come up with explanations as to why systematic differences in variability across these items would be linked with variability in personality. Regardless, I think the comparisons with Simpsons ratings are stronger, but it’s hard to find appropriately licensed Simpsons images to use in this post.↩︎

Measuring Happiness Is Harder (But Maybe Also Easier) Than You Think

Testing the Validity of the Day Reconstruction Method

ESM Versus DRM

ESM and DRM Versus Global Reports

Could Experiential Measures Be Less Valid Than Global Measures?

Some People Report More Within-Person Variability Than Others

Simply Asking About Variability Causes More Variability

Back to the DRM

Conclusion

References

Rich Lucas

Posts

Time For A Change At SPSP Journals

Measuring Happiness Is Harder (But Maybe Also Easier) Than You Think

How to Measure Happiness

Using R to Create Multiple Choice Exams

HARPing: Hedging After a Replication is Proposed

Happiness Research During the Replication Crisis

Yes, Your Field Does Need to Worry About Replicability

W.W.P.M.D?

The Rules of Replication: Part II