- Split View
-
Views
-
Cite
Cite
Christina R. Studts, Jodi Polaha, Michiel A. van Zyl, Identifying Unbiased Items for Screening Preschoolers for Disruptive Behavior Problems, Journal of Pediatric Psychology, Volume 42, Issue 4, May 2017, Pages 476–486, https://doi.org/10.1093/jpepsy/jsw090
- Share Icon Share
Abstract
Objective Efficient identification and referral to behavioral services are crucial in addressing early-onset disruptive behavior problems. Existing screening instruments for preschoolers are not ideal for pediatric primary care settings serving diverse populations. Eighteen candidate items for a new brief screening instrument were examined to identify those exhibiting measurement bias (i.e., differential item functioning, DIF) by child characteristics. Method Parents/guardians of preschool-aged children (N = 900) from four primary care settings completed two full-length behavioral rating scales. Items measuring disruptive behavior problems were tested for DIF by child race, sex, and socioeconomic status using two approaches: item response theory-based likelihood ratio tests and ordinal logistic regression.
Results Of 18 items, eight were identified with statistically significant DIF by at least one method. Conclusions The bias observed in 8 of 18 items made them undesirable for screening diverse populations of children. These items were excluded from the new brief screening tool.
Introduction
Preschool-aged children with early-onset disruptive behavior problems are at risk for future negative outcomes, including antisocial behaviors, drug use, and school failure (Moffitt & Caspi, 2001; Shaw, Gilliom, Ingoldsby, & Nagin, 2003). Epidemiological data show that >20% of children in the United States exhibit subthreshold to clinical levels of symptoms (Egger & Angold, 2006), with males, non-Hispanic Black children, and children from low socioeconomic status (SES) households at highest risk (Perou et al., 2013). Most affected children do not receive behavioral services, particularly those in vulnerable and underserved groups (Le Cook, Barry, & Busch, 2013).
In this article, disruptive behavior problems include behaviors observed in oppositional defiant and conduct disorders, and subclinical presentations approaching their diagnostic thresholds. This definition distinguishes aggressive and oppositional behaviors from the inattentive and/or hyperactive symptoms of attention deficit/hyperactivity disorder (ADHD; Forehand, Jones, & Parent, 2013; Pelham & Fabiano, 2008). Disruptive behavior disorders are among the most impairing of child mental health issues, and they frequently co-occur with ADHD, anxiety disorders, and mood disorders (Egger & Angold, 2006). A specific focus on early-emerging disruptive behavior disorders in the primary care setting is warranted because they constitute one of the most prevalent psychosocial presenting problems in pediatric primary care, as well as the most common reason for referrals to integrated and specialty mental health services (Kolko & Perrin, 2014).
A growing literature demonstrates the feasibility and effectiveness of behavioral interventions delivered in pediatric primary care (Asarnow, Rozenman, Wiblin, & Zeltzer, 2015; Stancin & Perrin, 2014), suggesting that services in this setting could have a significant positive impact on disruptive behavior problems. Pediatric primary care is an optimal setting for screening and referral because (1) children are seen at regular intervals for well-child examinations before contact with the school system; (2) ∼25% of children presenting in primary care have significant psychosocial concerns (Williams, Klinepeter, Palmes, Pulley, & Foy, 2004); and (3) parents are receptive to seeking behavioral services in primary care, reporting less perceived stigma in this setting compared with traditional mental health settings (Kolko, Campo, Kilbourne, & Kelleher, 2012; Polaha, Williams, Heflinger, & Studts, 2015). Indeed, there are a variety of initiatives aimed at leveraging pediatricians’ opportunities to identify and respond to early-onset behavioral and developmental concerns (e.g., Bright Futures, 2014). For such services to reach those who traditionally have not accessed treatment, formal screening methods are needed that improve the accuracy of pediatricians’ identification of children and families in need of assessment and intervention (Sheldrick, Merchant, & Perrin, 2011).
To that end, standardized screening instruments such as the 35-item Pediatric Symptom Checklist (PSC; Jellinek et al., 1988) and the 100- to 120-item Child Behavior Checklist (CBCL; Achenbach & Edelbrock, 1981) have been disseminated to pediatricians to facilitate identification of psychosocial problems in primary care; Bright Futures (2014) recommends the PSC, while the American Academy of Pediatrics (2012) identifies the CBCL as one of several valuable tools for primary care. These instruments or their short-form versions—Gardner and colleagues’ (1999) PSC-17 and Zill’s (1990) Behavior Problems Index (BPI)—can be used in conjunction with effective on-site assessment and intervention services (Kolko & Perrin, 2014).
Each of these instruments is multidimensional, including subscales measuring behaviors relevant to disruptive behavior disorders, ADHD, and internalizing disorders such as depression and anxiety. A strong argument can be made for focused screening for disruptive behavior problems in the preschool age group, however. Compared with disruptive behavior problems, underidentification of ADHD is less problematic in pediatric primary care because a high proportion of pediatricians have incorporated appropriate screening protocols using ADHD-specific tools into their practices (Gardner, Kelleher, Pager, & Campo, 2004). Further, the treatment approaches for ADHD often rely on medication and increasingly fall under pediatricians’ usual practice (Subcommittee on Attention-Deficit/Hyperactivity Disorder, Steering Committee on Quality Improvement and Management, 2011). Parent-reported screening tools for internalizing problems such as depression and anxiety, on the other hand, are challenged by the intrapersonal nature of symptoms (e.g., worry, sadness, and anhedonia) that are often not recognized by parent informants (Luby, 2010). In contrast, a brief parent-report screening tool focused solely on early-emerging disruptive behavior problems could have significant impact, given the prevalence, impairment, amenability to intervention, comorbidities, and negative outcomes associated with this issue.
Given the need to improve identification of early-onset disruptive behavior problems, existing brief screening tools such as PSC-17 and BPI are not optimal. Several items in these instruments are not clinically informative within the developmental context of preschool-aged children (Studts & van Zyl, 2013), meaning that these items contribute to error in measuring disruptive behaviors in this age-group. In addition to concerns regarding developmental appropriateness of several items in the PSC-17 and BPI when used with preschool-aged children, several studies have reported disparities in results obtained from these instruments by sex (Jellinek et al., 1999; Parcel & Menaghan, 1988), race/ethnicity (Jutte, Burgos, Mendoza, Ford, & Huffman, 2003; Simonian & Tarnowski, 2001; Spencer, Fitch, Grogan-Kaylor, & McBeath, 2005), and SES (Jellinek et al., 1999). To prevent disparities in identification of children with early-emerging disruptive behavior problems, screening measures should demonstrate robust results across diverse samples (Kraemer, 1992; Ramirez, Ford, Stewart, & Teresi, 2005).
Finally, from an implementation perspective, even these short instruments may be too long for adoption in primary care settings. Primary care is notoriously fast paced, with multiple issues vying to be addressed in short visits (Cooper, Valleley, Polaha, Begeny, & Evans, 2006). Recent changes in health care policy have put a premium on improved assessment in primary care (Berwick, Nolan, & Whittington, 2008; Patient Protection and Affordable Care Act, 2010), placing an added burden on patients’ and providers’ time. Thus, there is demand for ultra-brief (i.e., ≤ 5 items) screening tools, with examples of advances in the adult primary care field’s tools for anxiety (Berle et al., 2011), depression (Arroll, Goodyear-Smith, Kerse, Fishman, & Gunn, 2005), and drug and alcohol misuse (Bradley et al., 2007). Indeed, there is evidence that four to six items may be sufficient for measuring most constructs (Hinkin, 1998)—contrary to guidance based on classical test theory (CTT), which emphasizes the need for high numbers of items for reliability.
The overarching goal of this study was to advance the field toward an ultra-brief screening tool for the detection of early-onset disruptive behavior problems in preschool-aged children, one of the most prevalent and treatable psychosocial issues in pediatric primary care. The selection of items for such tools typically draws from pools of items used in existing instruments with careful attention to item-level measurement performance. Essential to this strategy are (a) the use of strong analytic methods and (b) consideration of item- and scale-level performance of the measure across critical sample characteristics (e.g., sex, race, and SES). Previous evaluations of the PSC-17 and the BPI have relied on CTT-based traditional psychometric analyses that were limited in their capacity to assess item-level measurement performance (Nunnally & Bernstein, 1994). In contrast, item response theory (IRT) offers applications and information unattainable with CTT-based methods (Hambleton, Swaminathan, & Rogers, 1991).
One valuable application of IRT methodology is the identification of items exhibiting differential item functioning (DIF), or item bias, in which responses to an item are affected not only by the level of the underlying construct but also by extraneous characteristics, such as sex, race, or SES (Teresi, 2006). If an item exhibits DIF, then systematic measurement error is present—that is, the use of that item produces biased results influenced by factors other than the construct being measured. If items in a screening tool are biased by sex, race, or SES, then screening results may be related to demographic characteristics rather than the condition being screened for, potentially leading to disparities in identification across demographic groups. Use of IRT-based methods to detect and remove biased items has been reported for many instruments developed under the Patient-Reported Outcomes Measurement Information System initiative (e.g., Coster et al., 2016; Pilkonis et al., 2013; Teresi et al., 2009) and others (e.g., Engelhard et al, 2016; Van Nispen, Knol, Langelaan, & van Rens, 2011).
In a previous study, disruptive behavior problem items from the PSC-17 and BPI were examined for their utility within the developmental context of preschool-aged children (Studts & van Zyl, 2013). To further inform selection of a brief set of items from these instruments, the current study used IRT-based methods to test PSC-17 and BPI items for DIF by child sex, race, and SES. The study aim was to build on previous results and identify a subset of items that (a) are developmentally appropriate for preschool-aged children and (b) perform consistently across diverse sociodemographic subgroups seen in pediatric primary care. An ultra-brief, developmentally appropriate, unbiased screening tool could improve identification of, and intervention with, children with early-onset disruptive behavior problems in the primary care setting.
Method
Participants
Parents (or other guardians, including grandparents and other primary caregivers) of preschool-aged children (N = 900; 96% response rate) were recruited to participate from four sociodemographically diverse primary care settings. Eligible participants were ≥18 years; parents/guardians of a child between the ages of 3 and 5 years; and in attendance at pediatric primary care appointments. Exclusion criteria included already having responded to the survey and presenting for an emergency appointment.
Participants provided demographic and behavioral health information about themselves and the target children, summarized in Table I. Parent/guardian ages ranged from 18 to 78 years, with a mean of 31 years (SD = 8 years), while approximately equal numbers of children were 3 years (32%), 4 years (38%), and 5 years (29%) old. The prevalence of concern regarding child behavior problems (n = 232, 26%) and reported receipt of mental health services (n = 85, 10%) was comparable with rates in community or nonreferred samples reported in previous studies (Keenan & Wakschlag, 2004; Lavigne, Lebailly, Hopkins, Gouze, & Binns, 2009).
Variable . | Frequency . | (%) . |
---|---|---|
Parent/guardian sex | ||
Male | 118 | (13) |
Female | 776 | (87) |
Parent/guardian race | ||
White | 491 | (55) |
Black | 375 | (42) |
Other | 32 | (3) |
Child sex | ||
Male | 472 | (53) |
Female | 424 | (47) |
Child race | ||
White | 450 | (50) |
Black | 362 | (40) |
Othera | 88 | (10) |
Child age | ||
3 years | 288 | (32) |
4 years | 342 | (38) |
5 years | 261 | (29) |
Socioeconomic status (SES) | ||
Low | 371 | (43) |
Medium | 285 | (33) |
High | 216 | (25) |
Parent/guardian believes child has behavior problems | 232 | (26) |
Child has seen a mental health provider | 85 | (10) |
Child has been prescribed medication(s) for behavior | 42 | (5) |
Variable . | Frequency . | (%) . |
---|---|---|
Parent/guardian sex | ||
Male | 118 | (13) |
Female | 776 | (87) |
Parent/guardian race | ||
White | 491 | (55) |
Black | 375 | (42) |
Other | 32 | (3) |
Child sex | ||
Male | 472 | (53) |
Female | 424 | (47) |
Child race | ||
White | 450 | (50) |
Black | 362 | (40) |
Othera | 88 | (10) |
Child age | ||
3 years | 288 | (32) |
4 years | 342 | (38) |
5 years | 261 | (29) |
Socioeconomic status (SES) | ||
Low | 371 | (43) |
Medium | 285 | (33) |
High | 216 | (25) |
Parent/guardian believes child has behavior problems | 232 | (26) |
Child has seen a mental health provider | 85 | (10) |
Child has been prescribed medication(s) for behavior | 42 | (5) |
Note. Percentages do not include missing data and may not sum to 100% because of rounding. SES was operationalized by creating an index combining responses regarding household income level, parent education level, and type of health insurance. Parent/guardian ages ranged from 18 to 78 years, with a mean of 31 years (SD = 8 years).
Two parents did not report their child’s race; 2 reported their child was American Indian; 6 reported their child was Asian; 17 reported their child was of Hispanic or Latino ethnicity but did not report race; and 61 parents described their child as multiracial.
Variable . | Frequency . | (%) . |
---|---|---|
Parent/guardian sex | ||
Male | 118 | (13) |
Female | 776 | (87) |
Parent/guardian race | ||
White | 491 | (55) |
Black | 375 | (42) |
Other | 32 | (3) |
Child sex | ||
Male | 472 | (53) |
Female | 424 | (47) |
Child race | ||
White | 450 | (50) |
Black | 362 | (40) |
Othera | 88 | (10) |
Child age | ||
3 years | 288 | (32) |
4 years | 342 | (38) |
5 years | 261 | (29) |
Socioeconomic status (SES) | ||
Low | 371 | (43) |
Medium | 285 | (33) |
High | 216 | (25) |
Parent/guardian believes child has behavior problems | 232 | (26) |
Child has seen a mental health provider | 85 | (10) |
Child has been prescribed medication(s) for behavior | 42 | (5) |
Variable . | Frequency . | (%) . |
---|---|---|
Parent/guardian sex | ||
Male | 118 | (13) |
Female | 776 | (87) |
Parent/guardian race | ||
White | 491 | (55) |
Black | 375 | (42) |
Other | 32 | (3) |
Child sex | ||
Male | 472 | (53) |
Female | 424 | (47) |
Child race | ||
White | 450 | (50) |
Black | 362 | (40) |
Othera | 88 | (10) |
Child age | ||
3 years | 288 | (32) |
4 years | 342 | (38) |
5 years | 261 | (29) |
Socioeconomic status (SES) | ||
Low | 371 | (43) |
Medium | 285 | (33) |
High | 216 | (25) |
Parent/guardian believes child has behavior problems | 232 | (26) |
Child has seen a mental health provider | 85 | (10) |
Child has been prescribed medication(s) for behavior | 42 | (5) |
Note. Percentages do not include missing data and may not sum to 100% because of rounding. SES was operationalized by creating an index combining responses regarding household income level, parent education level, and type of health insurance. Parent/guardian ages ranged from 18 to 78 years, with a mean of 31 years (SD = 8 years).
Two parents did not report their child’s race; 2 reported their child was American Indian; 6 reported their child was Asian; 17 reported their child was of Hispanic or Latino ethnicity but did not report race; and 61 parents described their child as multiracial.
Procedure
All study procedures were approved by the University of Louisville institutional review board. Recruitment occurred on various times and days of the week during 8 months of the school year. All participants provided informed consent, completed the survey in privacy, and were entered in a drawing to win one of five $100 gift cards.
Measures
The survey included a sociodemographic questionnaire, the PSC-17, and the BPI. The order of the behavior rating scales was counterbalanced to avoid response set or order bias.
Sociodemographic Questionnaire
Participant- and child-level data were obtained, including age, sex, race, level of household income, years of education completed, relationship to the child, and child’s type of health insurance. The child’s history of behavioral concerns and treatment was also accessed via parent report. For the purposes of DIF analyses, child SES was operationalized by summing ordinal responses regarding household income level, parent education level, and child’s type of health insurance, resulting in a SES index (possible scores = 0–6). In the low SES group (index scores from 0 to 2), 100% of parents/guardians reported having no more than a high school education or General Education Diploma (GED), 98% reported that their child was covered by Medicaid or K-CHIP, and 93% reported annual household incomes <$20,000. In the high SES group (index scores from 5 to 6), 89% of parents/guardians reported education beyond high school, 95% reported that their child was covered by private health insurance, and 71% reported annual household incomes >$50,000 (for reference, the median annual household income in Kentucky at the time of data collection was ∼$41,500; U.S. Census Bureau, 2009).
Pediatric Symptom Checklist-17
This brief version of the PSC was developed for use in pediatric clinics to screen children ages 4–16 years for psychosocial problems (Gardner et al., 1999). Parents/guardians rate their child on 17 items using a 3-point Likert-type scale (0 = never, 1 = sometimes, and 2 = often). Item responses are summed to compute a total score, with higher scores indicating higher levels of dysfunction. The PSC-17 includes a 7-item externalizing subscale comprising items measuring disruptive behavior problems (i.e., not ADHD items).
Behavior Problems Index
The BPI (Peterson & Zill, 1986; Zill, 1990) was developed for use in national longitudinal surveys to measure behavioral problems in children and was standardized on a random sample of 6,000 children (Baker, Keck, Mott, & Quinlan, 1993). Its items were derived from the CBCL (Achenbach & Edelbrock, 1981). Parents/guardians rate their preschool-aged child on 26 items using a 3-point Likert-type scale (0 = not true, 1 = sometimes true, and 2 = often true). Total scores are computed by summing item responses. Higher scores indicate higher levels of dysfunction. For the current study, 11 items measuring disruptive behaviors were used, comprising the headstrong subscale (five items), the antisocial subscale (four items), and the peer problems subscale (two items after excluding “withdrawn from peers”).
Statistical Analyses
The seven externalizing subscale items of the PSC-17 and the 11 disruptive behaviors items of the BPI were combined into a single set of candidate items for a new screening tool so that patterns of responses to all 18 items could be considered in IRT analyses. Steps in the IRT analyses included (a) testing IRT model assumptions; (b) fitting an IRT model to the data to obtain estimates of item and scale characteristics; and (c) individually testing each item for DIF by child sex, race, and SES, using two complementary procedures.
Samejima’s (1969) two-parameter graded response model (GRM) was fit to the responses to the 18 candidate items using IRTPRO software (Scientific Software International, Inc.). The GRM is appropriate for use with items with ordinal response categories, as used in the PSC-17 and BPI. For each item, the GRM model generated two difficulty parameter estimates (i.e., what levels of disruptive behavior problems are needed for a respondent to select sometimes vs. not true/never and to select often/often true vs. sometimes) and one discrimination parameter estimate (i.e., how well the item can discriminate between respondents reporting similar levels of disruptive behavior problems). Difficulty parameters are on a scale in which the mean level of disruptive behavior problems is set at 0, and levels can range from 3 SDs below the mean to 3 SDs above the mean (i.e., a difficulty parameter of 1.5 reflects a level of disruptive behavior problems 1.5 SDs above the mean) (Hambleton et al., 1991).
Model fit was evaluated in IRTPRO using two standard approaches: (a) item-level S-χ2 diagnostic statistics, which determine whether observed versus model-based expected proportions of responses to each item’s response options differ significantly; and (b) the root mean squared error approximation (RMSEA), a conservative absolute measure of overall model fit, which determines how far the model is from perfect model–data fit (MacCallum, Browne, & Sugawara, 1996). For item-level S-χ2 diagnostic statistics, the majority of items should demonstrate no significant observed versus expected differences in proportions (i.e., most p-values >.05), and for RMSEA, lower values suggest better overall model-data fit, with RMSEA < .05 indicating good fit.
Finally, two broad types of DIF can bias item-level measurement (see Table II): uniform DIF (consistent group differences in item difficulty parameters, controlling for level of behavior problems) and nonuniform DIF (group differences in item discrimination parameters, with or without differences in item difficulty parameters, controlling for level of behavior problems). Because more than one DIF detection method can be used to identify each type of DIF, but different methods can produce different results, two established procedures were used and the results from each method were compared (Teresi, 2006). Comparisons of interest were for male children versus female children; for White children versus minority children; and for low SES children versus the group of combined medium/high SES children. Analyses of DIF by child race controlled for child SES, and vice versa. For each DIF detection approach, a Bonferroni correction for conducting comparisons with 18 items was used to preserve overall α at .05 with p < .0027 for significance (i.e., .0027 = .05/18).
Type . | Brief description . | Example . |
---|---|---|
Uniform DIF | Controlling for level of disruptive behavior problems, groups differ significantly on a specific item’s difficulty parameters, but not on discrimination parameters. This difference in difficulty parameters is constant and in the same direction along the spectrum of levels of disruptive behavior problems. | Higher levels of disruptive behavior problems are needed for parents/guardians of minority children to select higher item response options, compared with parents/guardians of White children. |
Nonuniform DIF | Controlling for level of disruptive behavior problems, groups differ significantly on a specific item’s discrimination parameter (noncrossing nonuniform DIF) or on both the discrimination parameter and the difficulty parameters (crossing nonuniform DIF). | At low levels of disruptive behavior problems, lower levels are needed for parent/guardians of minority children versus White children to select higher item response options. At high levels of disruptive behavior problems, this difference is in the opposite direction: higher levels are needed for parents/guardians of minority children versus White children to select higher item response options. |
Type . | Brief description . | Example . |
---|---|---|
Uniform DIF | Controlling for level of disruptive behavior problems, groups differ significantly on a specific item’s difficulty parameters, but not on discrimination parameters. This difference in difficulty parameters is constant and in the same direction along the spectrum of levels of disruptive behavior problems. | Higher levels of disruptive behavior problems are needed for parents/guardians of minority children to select higher item response options, compared with parents/guardians of White children. |
Nonuniform DIF | Controlling for level of disruptive behavior problems, groups differ significantly on a specific item’s discrimination parameter (noncrossing nonuniform DIF) or on both the discrimination parameter and the difficulty parameters (crossing nonuniform DIF). | At low levels of disruptive behavior problems, lower levels are needed for parent/guardians of minority children versus White children to select higher item response options. At high levels of disruptive behavior problems, this difference is in the opposite direction: higher levels are needed for parents/guardians of minority children versus White children to select higher item response options. |
Note. DIF = differential item functioning.
Type . | Brief description . | Example . |
---|---|---|
Uniform DIF | Controlling for level of disruptive behavior problems, groups differ significantly on a specific item’s difficulty parameters, but not on discrimination parameters. This difference in difficulty parameters is constant and in the same direction along the spectrum of levels of disruptive behavior problems. | Higher levels of disruptive behavior problems are needed for parents/guardians of minority children to select higher item response options, compared with parents/guardians of White children. |
Nonuniform DIF | Controlling for level of disruptive behavior problems, groups differ significantly on a specific item’s discrimination parameter (noncrossing nonuniform DIF) or on both the discrimination parameter and the difficulty parameters (crossing nonuniform DIF). | At low levels of disruptive behavior problems, lower levels are needed for parent/guardians of minority children versus White children to select higher item response options. At high levels of disruptive behavior problems, this difference is in the opposite direction: higher levels are needed for parents/guardians of minority children versus White children to select higher item response options. |
Type . | Brief description . | Example . |
---|---|---|
Uniform DIF | Controlling for level of disruptive behavior problems, groups differ significantly on a specific item’s difficulty parameters, but not on discrimination parameters. This difference in difficulty parameters is constant and in the same direction along the spectrum of levels of disruptive behavior problems. | Higher levels of disruptive behavior problems are needed for parents/guardians of minority children to select higher item response options, compared with parents/guardians of White children. |
Nonuniform DIF | Controlling for level of disruptive behavior problems, groups differ significantly on a specific item’s discrimination parameter (noncrossing nonuniform DIF) or on both the discrimination parameter and the difficulty parameters (crossing nonuniform DIF). | At low levels of disruptive behavior problems, lower levels are needed for parent/guardians of minority children versus White children to select higher item response options. At high levels of disruptive behavior problems, this difference is in the opposite direction: higher levels are needed for parents/guardians of minority children versus White children to select higher item response options. |
Note. DIF = differential item functioning.
The first method for DIF detection was the IRT-based likelihood ratio test (IRT-LR; Thissen, 2001). The IRT-LR test facilitates identification of both uniform and nonuniform DIF in items yielding different parameter estimates for reference and focal groups. Statistically significant likelihood ratio statistics indicated improved model fit when a given item’s parameters (difficulty, discrimination, or both) were permitted to vary between groups. The IRT-LR DIF detection method was implemented using IRTLRDIF software (Thissen, 2001).
The second method for detecting DIF was the ordinal logistic regression approach (OLR), developed by Crane and colleagues (2004). For this approach, three nested OLR models were fit for each item: (a) a model including the main effect of level of disruptive behavior problems only; (b) a model including the main effects of level of disruptive behavior problems as well as of group membership; and (c) a model including both main effects plus their interaction effect. Statistical significance of the main effect of group and/or the interaction effect between group and level of disruptive behavior problems were indicative of uniform and nonuniform DIF, respectively. The OLR analyses were conducted using IBM SPSS for Windows, Version 20.0.
Items identified with statistically significant DIF at the adjusted level of significance by either method were flagged. Item parameters were reestimated for all 18 items, allowing the parameters of items with DIF to vary by group membership. For each flagged item, mean group differences in parameter reestimates were assessed to characterize the amount of DIF observed and determine its possible impact on measurement of disruptive behavior problems in each subgroup (Steinberg & Thissen, 2006).
Results
Unidimensionality of the 18 combined candidate items measuring disruptive behaviors was supported by (a) strong internal consistency reliability (α = .89); (b) principal axis factoring resulting in a first factor eigenvalue (6.53) that was 5.05 times the second factor eigenvalue (1.29), exceeding the criterion of 5 times suggested by Hambleton and colleagues (1991) for a dominant single factor; and (c) single factor structure coefficients ranging from .54 to .80. While 2 of the 18 S-χ2 values had p < .05, the RMSEA value (0.04) suggested good overall model fit.
Results of DIF analyses by child sex, race, and SES are summarized in Table III. The IRT-LR method identified no items with DIF by child sex. However, the OLR approach identified uniform DIF by child sex in two items: “refuses to share” (PSC-17 4) and “breaks/destroys things” (BPI 22). Parents of female children had significantly higher odds of selecting higher response options for “refuses to share” than parents of male children, controlling for underlying level of disruptive behavior problems. In contrast, parents/guardians of female children had significantly lower odds of selecting higher response options for “breaks/destroys things” than those of male children, controlling for underlying level of disruptive behavior problems.
. | . | Male . | Female . | ||||
---|---|---|---|---|---|---|---|
Item . | Short wording . | a (se) . | b1 (se) . | b2 (se) . | a (se) . | b1 (se) . | b2 (se) . |
PSC-17 4a | Refuses to share | 1.30 (0.08) | −0.90 (0.12) | 1.95 (0.17) | 1.30 (0.08) | –1.35 (0.13) | 1.81 (0.18) |
BPI 22a | Breaks/destroys things | 1.90 (0.14) | 0.47 (0.08) | 1.75 (0.14) | 1.90 (0.14) | 0.78 (0.10) | 2.07 (0.18) |
White | Minority | ||||||
a (se) | b1 (se) | b2 (se) | a (se) | b1 (se) | b2 (se) | ||
PSC-17 10b | Blames others | 1.24 (0.18) | −0.18 (0.12) | 2.30 (0.30) | 1.75 (0.21) | 0.03 (0.09) | 1.58 (0.18) |
PSC-17 14a,c | Teases others | 1.38 (0.10) | 0.32 (0.11) | 2.86 (0.28) | 1.38 (0.10) | –0.10 (0.10) | 2.20 (0.21) |
BPI 3a,c | High strung | 1.15 (0.07) | 0.33 (0.13) | 2.20 (0.20) | 1.15 (0.07) | 0.95 (0.14) | 2.68 (0.23) |
BPI 6a,c | Argues too much | 1.50 (0.09) | −0.84 (0.11) | 1.10 (0.12) | 1.50 (0.09) | –0.22 (0.10) | 1.60 (0.15) |
Low SES | Medium/high SES | ||||||
a (se) | b1 (se) | b2 (se) | a (se) | b1 (se) | b2 (se) | ||
BPI 3c | High strung | 1.24 (0.08) | 0.84 (0.14) | 2.54 (0.22) | 1.24 (0.08) | 0.40 (0.11) | 2.05 (0.19) |
BPI 4c | Cheats/lies | 1.26 (0.08) | −0.38 (0.13) | 1.79 (0.18) | 1.26 (0.08) | –0.38 (0.11) | 2.50 (0.24) |
BPI 18c | Stubborn, sullen, or irritable | 1.55 (0.09) | −0.69 (0.11) | 1.48 (0.15) | 1.55 (0.09) | –1.03 (0.10) | 1.21 (0.13) |
. | . | Male . | Female . | ||||
---|---|---|---|---|---|---|---|
Item . | Short wording . | a (se) . | b1 (se) . | b2 (se) . | a (se) . | b1 (se) . | b2 (se) . |
PSC-17 4a | Refuses to share | 1.30 (0.08) | −0.90 (0.12) | 1.95 (0.17) | 1.30 (0.08) | –1.35 (0.13) | 1.81 (0.18) |
BPI 22a | Breaks/destroys things | 1.90 (0.14) | 0.47 (0.08) | 1.75 (0.14) | 1.90 (0.14) | 0.78 (0.10) | 2.07 (0.18) |
White | Minority | ||||||
a (se) | b1 (se) | b2 (se) | a (se) | b1 (se) | b2 (se) | ||
PSC-17 10b | Blames others | 1.24 (0.18) | −0.18 (0.12) | 2.30 (0.30) | 1.75 (0.21) | 0.03 (0.09) | 1.58 (0.18) |
PSC-17 14a,c | Teases others | 1.38 (0.10) | 0.32 (0.11) | 2.86 (0.28) | 1.38 (0.10) | –0.10 (0.10) | 2.20 (0.21) |
BPI 3a,c | High strung | 1.15 (0.07) | 0.33 (0.13) | 2.20 (0.20) | 1.15 (0.07) | 0.95 (0.14) | 2.68 (0.23) |
BPI 6a,c | Argues too much | 1.50 (0.09) | −0.84 (0.11) | 1.10 (0.12) | 1.50 (0.09) | –0.22 (0.10) | 1.60 (0.15) |
Low SES | Medium/high SES | ||||||
a (se) | b1 (se) | b2 (se) | a (se) | b1 (se) | b2 (se) | ||
BPI 3c | High strung | 1.24 (0.08) | 0.84 (0.14) | 2.54 (0.22) | 1.24 (0.08) | 0.40 (0.11) | 2.05 (0.19) |
BPI 4c | Cheats/lies | 1.26 (0.08) | −0.38 (0.13) | 1.79 (0.18) | 1.26 (0.08) | –0.38 (0.11) | 2.50 (0.24) |
BPI 18c | Stubborn, sullen, or irritable | 1.55 (0.09) | −0.69 (0.11) | 1.48 (0.15) | 1.55 (0.09) | –1.03 (0.10) | 1.21 (0.13) |
Note. DIF = differential item functioning; SES = socioeconomic status; a = item discrimination parameter; b1 = lower item difficulty threshold parameter; b2 = upper item difficulty threshold parameter; se = standard error; IRT-LR = item response theory-based likelihood ratio test method. OLR = ordinal logistic regression method; BPI = Behavior Problems Index; PSC = Pediatric Symptom Checklist. The significance level for DIF using each method was set using a Bonferroni correction, adjusted for analyses of 18 items (p < .0027). OLR analyses investigating race controlled for SES, and those investigating SES controlled for race. Item difficulty parameters are on a scale where the mean level of disruptive behavior problems is set at 0 and difficulty levels range from −3 SDs to +3 SDs.
Uniform DIF detected by OLR.
Nonuniform DIF detected by OLR.
Uniform DIF detected by IRT-LR.
. | . | Male . | Female . | ||||
---|---|---|---|---|---|---|---|
Item . | Short wording . | a (se) . | b1 (se) . | b2 (se) . | a (se) . | b1 (se) . | b2 (se) . |
PSC-17 4a | Refuses to share | 1.30 (0.08) | −0.90 (0.12) | 1.95 (0.17) | 1.30 (0.08) | –1.35 (0.13) | 1.81 (0.18) |
BPI 22a | Breaks/destroys things | 1.90 (0.14) | 0.47 (0.08) | 1.75 (0.14) | 1.90 (0.14) | 0.78 (0.10) | 2.07 (0.18) |
White | Minority | ||||||
a (se) | b1 (se) | b2 (se) | a (se) | b1 (se) | b2 (se) | ||
PSC-17 10b | Blames others | 1.24 (0.18) | −0.18 (0.12) | 2.30 (0.30) | 1.75 (0.21) | 0.03 (0.09) | 1.58 (0.18) |
PSC-17 14a,c | Teases others | 1.38 (0.10) | 0.32 (0.11) | 2.86 (0.28) | 1.38 (0.10) | –0.10 (0.10) | 2.20 (0.21) |
BPI 3a,c | High strung | 1.15 (0.07) | 0.33 (0.13) | 2.20 (0.20) | 1.15 (0.07) | 0.95 (0.14) | 2.68 (0.23) |
BPI 6a,c | Argues too much | 1.50 (0.09) | −0.84 (0.11) | 1.10 (0.12) | 1.50 (0.09) | –0.22 (0.10) | 1.60 (0.15) |
Low SES | Medium/high SES | ||||||
a (se) | b1 (se) | b2 (se) | a (se) | b1 (se) | b2 (se) | ||
BPI 3c | High strung | 1.24 (0.08) | 0.84 (0.14) | 2.54 (0.22) | 1.24 (0.08) | 0.40 (0.11) | 2.05 (0.19) |
BPI 4c | Cheats/lies | 1.26 (0.08) | −0.38 (0.13) | 1.79 (0.18) | 1.26 (0.08) | –0.38 (0.11) | 2.50 (0.24) |
BPI 18c | Stubborn, sullen, or irritable | 1.55 (0.09) | −0.69 (0.11) | 1.48 (0.15) | 1.55 (0.09) | –1.03 (0.10) | 1.21 (0.13) |
. | . | Male . | Female . | ||||
---|---|---|---|---|---|---|---|
Item . | Short wording . | a (se) . | b1 (se) . | b2 (se) . | a (se) . | b1 (se) . | b2 (se) . |
PSC-17 4a | Refuses to share | 1.30 (0.08) | −0.90 (0.12) | 1.95 (0.17) | 1.30 (0.08) | –1.35 (0.13) | 1.81 (0.18) |
BPI 22a | Breaks/destroys things | 1.90 (0.14) | 0.47 (0.08) | 1.75 (0.14) | 1.90 (0.14) | 0.78 (0.10) | 2.07 (0.18) |
White | Minority | ||||||
a (se) | b1 (se) | b2 (se) | a (se) | b1 (se) | b2 (se) | ||
PSC-17 10b | Blames others | 1.24 (0.18) | −0.18 (0.12) | 2.30 (0.30) | 1.75 (0.21) | 0.03 (0.09) | 1.58 (0.18) |
PSC-17 14a,c | Teases others | 1.38 (0.10) | 0.32 (0.11) | 2.86 (0.28) | 1.38 (0.10) | –0.10 (0.10) | 2.20 (0.21) |
BPI 3a,c | High strung | 1.15 (0.07) | 0.33 (0.13) | 2.20 (0.20) | 1.15 (0.07) | 0.95 (0.14) | 2.68 (0.23) |
BPI 6a,c | Argues too much | 1.50 (0.09) | −0.84 (0.11) | 1.10 (0.12) | 1.50 (0.09) | –0.22 (0.10) | 1.60 (0.15) |
Low SES | Medium/high SES | ||||||
a (se) | b1 (se) | b2 (se) | a (se) | b1 (se) | b2 (se) | ||
BPI 3c | High strung | 1.24 (0.08) | 0.84 (0.14) | 2.54 (0.22) | 1.24 (0.08) | 0.40 (0.11) | 2.05 (0.19) |
BPI 4c | Cheats/lies | 1.26 (0.08) | −0.38 (0.13) | 1.79 (0.18) | 1.26 (0.08) | –0.38 (0.11) | 2.50 (0.24) |
BPI 18c | Stubborn, sullen, or irritable | 1.55 (0.09) | −0.69 (0.11) | 1.48 (0.15) | 1.55 (0.09) | –1.03 (0.10) | 1.21 (0.13) |
Note. DIF = differential item functioning; SES = socioeconomic status; a = item discrimination parameter; b1 = lower item difficulty threshold parameter; b2 = upper item difficulty threshold parameter; se = standard error; IRT-LR = item response theory-based likelihood ratio test method. OLR = ordinal logistic regression method; BPI = Behavior Problems Index; PSC = Pediatric Symptom Checklist. The significance level for DIF using each method was set using a Bonferroni correction, adjusted for analyses of 18 items (p < .0027). OLR analyses investigating race controlled for SES, and those investigating SES controlled for race. Item difficulty parameters are on a scale where the mean level of disruptive behavior problems is set at 0 and difficulty levels range from −3 SDs to +3 SDs.
Uniform DIF detected by OLR.
Nonuniform DIF detected by OLR.
Uniform DIF detected by IRT-LR.
Three items were identified with significant uniform DIF by child race (controlling for SES) using both the IRT-LR and OLR methods—“teases others” (PSC-17 14), “high strung” (BPI 3), and “argues too much” (BPI 6)—each exhibited DIF in the difficulty parameters between White children and minority children. Compared with parents/guardians of White children and controlling for underlying level of disruptive behavior problems and SES, parents/guardians of minority children had significantly higher odds of selecting higher response options for “teases others,” but significantly lower odds of selecting higher response options for “high strung” and “argues too much.” In addition, the OLR approach detected one item with significant nonuniform DIF by child race: “blames others” (PSC-17 10), suggesting that differences in responses between parents/guardians of White and minority children varied significantly along the continuum of disruptive behavior problems.
The IRT-LR method identified three items with significant uniform DIF by child SES (controlling for race): “high strung” (BPI 3), “cheats/lies” (BPI 4), and “stubborn, sullen, or irritable” (BPI 18) each demonstrated significant differences in difficulty parameters between low SES and medium/high SES participants. No significant uniform or nonuniform DIF was detected by SES using OLR.
Overall, 8 of 18 items were identified with at least one type of statistically significant DIF at the Bonferroni-corrected level of significance: five items by a single DIF detection method (“refuses to share,” “blames others,” “cheats/lies,” “stubborn, sullen, or irritable,” and “breaks/destroys things”) and three items by both methods (“teases others,” “high strung,” and “argues too much”). Table III presents the item parameter estimates obtained after recalibrating items allowing for differences between groups. The magnitude of differences between group parameter estimates ranged from small (0.14 SD) to large and clinically significant (0.72 SD; Teresi et al., 2009).
Discussion
The goal of this study was to inform the development of an ultra-brief screening tool for early-onset behavior problems by identifying item-level measurement bias by child sex, race, and SES among candidate items. Item-level measurement bias, or DIF, is problematic because it means that a given item functions differently for different subgroups, independent of level of the construct being measured. DIF is especially concerning in brief measures, where the number of items is small, and the relative impact of each item’s score on the total is large. An unbiased, developmentally appropriate ultra-brief screening tool could improve the sensitivity and specificity of pediatricians’ identification of disruptive behavior problems among diverse preschool-aged children in primary care, which are currently suboptimal (Sheldrick et al., 2011).
Item and Scale-Level DIF
Avoiding DIF altogether in the construction of screening instruments is important because various combinations of items demonstrating bias may have differing effects on scale-level measurement for affected groups of respondents (Wainer, 1995). In total, 8 of 18 investigated disruptive behavior items from the PSC-17 and the BPI exhibited statistically significant DIF by child sex, race, or SES, with magnitudes ranging from clinically nonsignificant to clinically meaningful. Notably, several of the items exhibiting DIF by child sex, race, and SES did so in the same directions. Especially in brief instruments, scores may be increased or decreased if multiple items exhibit DIF by that characteristic in the same direction, leading to sociodemographically driven over- or underidentification of children in need of assessment and intervention (Millsap, 2007).
Selection of DIF-Free Items for Screening
Of 18 candidate items, 10 performed consistently across child sex, race, and SES when analyzed using two different DIF detection methods. For these 10 items, results suggested that the relationship between the latent construct of disruptive behavior problems and the content of each DIF-free item was not biased by the investigated sociodemographic characteristics.
The results of the present study augment prior research identifying promising items for an ultra-brief screening tool for early-emerging behavioral problems. Previously, 8 of the 18 candidate items measuring disruptive behaviors were identified as providing high levels of measurement information within the developmental context of preschool-aged children (see Table IV; Studts & van Zyl, 2013). These items required subclinical to clinical levels of disruptive behavior problems for caregivers to select often rather than sometimes in describing their child. In addition to these eight promising screening items, “does not understand other people’s feelings” (PSC-17 5), while slightly less informative than the others, is relevant to the Callous and Unemotional specifier for conduct disorder (American Psychiatric Association, 2013). These nine items were identified as promising candidates for a brief screening tool for early identification of disruptive behavior problems in preschool-aged children.
Item . | Wording . | DIF . | Included in final tool . | ||
---|---|---|---|---|---|
Sex . | Race . | SES . | |||
PSC-17 5 | Does not understand other people’s feelings | X | |||
PSC-17 8 | Fights with other children | X | |||
PSC-17 10 | Blames others | X | |||
PSC-17 16 | Takes things that do not belong to him/her | X | |||
BPI 9 | Bullies/is cruel to others | X | |||
BPI 11 | Lack of remorse after misbehavior | X | |||
BPI 12 | Difficulty getting along with other children | X | |||
BPI 15 | Not liked by other children | X | |||
BPI 22 | Deliberately breaks/destroys things | X |
Item . | Wording . | DIF . | Included in final tool . | ||
---|---|---|---|---|---|
Sex . | Race . | SES . | |||
PSC-17 5 | Does not understand other people’s feelings | X | |||
PSC-17 8 | Fights with other children | X | |||
PSC-17 10 | Blames others | X | |||
PSC-17 16 | Takes things that do not belong to him/her | X | |||
BPI 9 | Bullies/is cruel to others | X | |||
BPI 11 | Lack of remorse after misbehavior | X | |||
BPI 12 | Difficulty getting along with other children | X | |||
BPI 15 | Not liked by other children | X | |||
BPI 22 | Deliberately breaks/destroys things | X |
Note. DIF = differential item functioning; SES = socioeconomic status; BPI = Behavior Problems Index; PSC = Pediatric Symptom Checklist. Developmentally appropriate candidate screening items were identified in Studts & van Zyl, 2013. Tests for DIF compared item parameters by sex (male vs. female children), race (White vs. minority children), and SES (low vs. medium/high SES combined). Items were flagged with statistically significant DIF after implementation of a Bonferroni correction for each method, adjusted for analyses of 18 items (p < .0027).
Item . | Wording . | DIF . | Included in final tool . | ||
---|---|---|---|---|---|
Sex . | Race . | SES . | |||
PSC-17 5 | Does not understand other people’s feelings | X | |||
PSC-17 8 | Fights with other children | X | |||
PSC-17 10 | Blames others | X | |||
PSC-17 16 | Takes things that do not belong to him/her | X | |||
BPI 9 | Bullies/is cruel to others | X | |||
BPI 11 | Lack of remorse after misbehavior | X | |||
BPI 12 | Difficulty getting along with other children | X | |||
BPI 15 | Not liked by other children | X | |||
BPI 22 | Deliberately breaks/destroys things | X |
Item . | Wording . | DIF . | Included in final tool . | ||
---|---|---|---|---|---|
Sex . | Race . | SES . | |||
PSC-17 5 | Does not understand other people’s feelings | X | |||
PSC-17 8 | Fights with other children | X | |||
PSC-17 10 | Blames others | X | |||
PSC-17 16 | Takes things that do not belong to him/her | X | |||
BPI 9 | Bullies/is cruel to others | X | |||
BPI 11 | Lack of remorse after misbehavior | X | |||
BPI 12 | Difficulty getting along with other children | X | |||
BPI 15 | Not liked by other children | X | |||
BPI 22 | Deliberately breaks/destroys things | X |
Note. DIF = differential item functioning; SES = socioeconomic status; BPI = Behavior Problems Index; PSC = Pediatric Symptom Checklist. Developmentally appropriate candidate screening items were identified in Studts & van Zyl, 2013. Tests for DIF compared item parameters by sex (male vs. female children), race (White vs. minority children), and SES (low vs. medium/high SES combined). Items were flagged with statistically significant DIF after implementation of a Bonferroni correction for each method, adjusted for analyses of 18 items (p < .0027).
However, while these nine items may target clinically significant disruptive behavior problems among children ages 3–5 years, two of them exhibited statistically significant DIF in the current study: “deliberately breaking/destroying things” (BPI 22) and “blaming others” (PSC-17 10). When used with diverse populations of young children seen in pediatric primary care, inclusion of items with DIF has implications for scoring, development of norms, and validity of results; thus, DIF-free items are preferable. Results of the current study suggest seven developmentally appropriate, DIF-free items for use in screening, highlighted in Table IV. These items are currently being assessed in a validity study involving parents/guardians of preschool-aged children followed in pediatric primary care.
Limitations
Results should be interpreted in the context of study limitations. First, all items relied on parent or guardian report. Sole reliance on parent report is inadequate in diagnostic assessment, but is routine and standard in behavioral screening with young children; its use in this study is congruent with real-world practice. Second, the 18 assessed items were drawn from only two existing instruments and may not include all disruptive behaviors salient to screening preschool-aged children. Those analyzed, however, include developmentally and clinically informative items (Studts & van Zyl, 2013) and were drawn from previously established, publicly available instruments. Third, child race was coarsely categorized in this study: DIF analyses compared non-Hispanic White with minority children. Minority children were mostly Black but also included Asian, American Indian, and multiracial children, as well as children of Hispanic or Latino ethnicity. While this decision prevents specific conclusions regarding item performance within individual racial or ethnic minority subgroups, these analyses demonstrate that even when using a somewhat heterogeneous comparison group, DIF is still problematic in items used in standard full-length behavioral screening measures.
The study’s large sample size (N = 900) may have increased the likelihood of type I error. However, a Bonferroni correction—a conservative approach compared with other methods of adjusting for multiple comparisons—was used for each DIF detection approach, and effect sizes were considered in addition to significance testing. Finally, no external diagnostic criterion (e.g., full behavioral assessment) was used in this study, prohibiting conclusions regarding criterion-related validity of the investigated items. Additional psychometric assessment of the seven selected items is needed to establish their reliability, validity, and classification accuracy in screening diverse populations of young children, including racial, ethnic, and SES subgroups.
Conclusions and Future Directions
In summary, 8 of 18 items measuring disruptive behavior problems in preschool-aged children exhibited small to large degrees of statistically significant DIF by child sex, race, or SES. Differential parent/guardian responses to particular items could be related to sociodemographic characteristics of the child or family; for example, group norms, cultural issues, or societal expectations may influence the perceived acceptability of target behaviors, leading to over- or underreporting within subgroups (Kagan, Snidman, McManis, Woodward, & Hardway, 2002; Simonian & Tarnowski, 2001). In contrast, actual differences in child behaviors or attributes could exist between certain groups, captured by disparate responses to items measuring such behaviors. Other contributing factors, such as idiosyncratic item wording, participant literacy, or other unmeasured child or family characteristics, are possible as well.
While this study was not designed to investigate possible explanations for DIF by child sex, race, and SES, future studies could explore the etiology of the observed disparities in item performance. For example, higher response options to “refuses to share” were selected by parents/guardians of girls versus boys, controlling for underlying level of disruptive behavior problems. Sharing may be a social behavior expected more of girls than of boys (Maccoby, 1988), leading to heightened sensitivity among girls’ parents when difficulty sharing is observed in otherwise behaviorally typical children. Alternatively, the frequency and ease of girls’ and boys’ sharing may vary and be differentially related to levels of disruptive behavior problems. Similar questions regarding observed DIF by child race and SES could be explored in future research.
Regardless of the causes of the DIF identified in these analyses, biased items are particularly undesirable in a brief screening instrument targeting the diverse population of young children seen in pediatric primary care settings. The reliability and validity of the set of seven DIF-free, developmentally appropriate items identified in this study are currently being assessed, and future studies should investigate its psychometric properties in specific sociodemographic subgroups, including groups underrepresented in this study (e.g., children of Hispanic or Latino ethnicity). If its psychometric properties are strong, future work will address the feasibility of implementing use of this instrument in pediatric primary care. Answering these critical questions constitutes the next step in advancing screening and referral practice for early-emerging disruptive behavior problems in pediatric primary care.
Acknowledgments
The authors wish to thank the physicians, staff, and patients of the University of Louisville Department of Pediatrics and Oldham County Pediatrics, as well as Gerard M. Barber, PhD; James Clark, PhD; Andrew J. Frey, PhD; V. Faye Jones, MD; John V. Lavigne, PhD; Carl G. Leukefeld, DSW; and Richard Milich, PhD, for their support. The authors also acknowledge the invaluable data collection assistance of Judith Friedrich, PhD, Demeka Campbell, MD, and Cynthia Bowman-Stroud, MD.
Funding
This work was supported by the National Center for Advancing Translational Sciences via an award (8KL2TR000116 to C.R.S.) from the University of Kentucky CTSA (UL1TR000117). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Conflicts of interest: None declared.
References