Validation of a novel online depression symptom severity rating scale: the R8 Depression

Background An automated web-based assessment and monitoring system (www.psynary.com) has been developed to assist non-specialist clinicians in managing common mood and anxiety disorders. Psynary promotes the use of standardised outcome measures to assess symptom severity and optimise treatments with the aim of improving outcomes and enabling faster recovery. This paper analyses the results from two parallel studies in New Zealand and Japan (OptiMA-1 NZ and Japan) to assess the validity of the R8 Depression scale, one of the system’s core outcome measures. Methods Clinical samples were recruited from a public secondary care and a private psychiatry clinic. Participants completed the outcome measures for the study via the online Psynary system. The R8 Depression scale is a 30-item questionnaire which includes all symptom domains covered in the ICD-10 classification of depression. The Patient Health Questionnaire (PHQ-9) was completed at the same time points as the R8 Depression, with a smaller sample also completing a paper-based Quick Inventory of Depressive Symptomatology (QIDS-SR16). Internal validity was quantified via Cronbach’s alpha and Guttman lower bounds method. External validation against the PHQ-9 and QIDS used the Pearson’s and Kendall’s correlation coefficients. Severity categories were set using a multivariate regression model. Results 270 patients participated in the study and completed a maximum of 1 baseline and 5 reviews within a 90-day period, giving a total of 1124 assessments with the PHQ-9 also being completed in 1053 of these assessments. R8 Depression normative data was also collected from 204 non-clinical volunteers with 187 of these also completing the PHQ9. Internal reliability scores were all higher than 0.9 (n = 1328). There was overall good external validity when comparing the R8 Depression to the PHQ-9, with a correlation of 0.91 for the combined normative and clinical samples (n = 1240). Conclusions The R8 Depression has been developed as a patient-rated outcome measure for depression for administration on an online system called “Psynary”. It has high internal and external validity against current widely used scales. Further work is underway to determine the sensitivity to change of the R8 Depression.


Background
Depression is often a recurrent or chronic condition across the lifespan [1]. As well as the direct impact on health, depression has enormous direct and indirect costs for the individual, their families and society. Depression is the leading cause of sickness absence from work in industrialised regions such as Europe [2], and the annual economic burden in the US in 2010 estimated at $210 billion, 50% of which was workplace-related [3]. The prevalence of common mood and anxiety disorders in primary care far exceeds the availability of mental health specialists, and there is growing awareness of the need to look beyond the mental health workforce to meet treatment demands [4].
The goal of treatment for depression is complete remission from depressive symptoms. Achieving remission is crucial as residual depressive symptoms are the strongest predictor of early relapse and are strongly associated with poorer functional outcomes [5]. Achieving earlier remission from a depressive episode may be associated with reductions in the enormous indirect economic costs of the condition [4,6]. However, STAR*D, the largest clinical trial examining outcomes for treatments of depression, highlighted that only a third of patients achieved remission on first-line treatments and up to four successive trials of different regimes were required to double the remission rate [7]. Unfortunately, clinicians often treat depression sub-optimally [8]. Where treatment is initiated, clinicians often wait until at least 6 weeks prior to attempting optimisation of dosages or changing medications, with patients sometimes left on ineffective or even harmful treatments for longer [9]. Optimising treatment may take many months, with the lost opportunity of potentially achieving earlier remission and functional recovery, and a failure to realise the potential indirect economic savings for society.
A way of improving the selection of modalities of treatment is by measuring symptom severity using standardised clinical measures [10]. Measuring detailed symptom outcomes enables detection of response to treatment as early as 1 week after initiation [11]. If a 20% improvement in depressive symptoms were not detected within 2 weeks from the start of the treatment, only 11% would respond to that treatment if it were maintained for longer [12]. Providing clear feedback to patients on their outcomes may also enhance treatment response [13]. Therefore, the routine use of detailed outcome measures offers the potential for prompt optimisation and personalisation of treatments for depression and faster functional recovery. Nevertheless, the adoption of such measures in busy clinics remains low [14]. This could be due to time constraints, limited awareness or knowledge of the clinician, and/or availability of the scales (including licensing restrictions, costs).
To address these barriers to translating insights from clinical trials into routine clinical practice, an international collaboration of specialist psychiatrists and IT engineers have developed an automated online assessment and outcome monitoring system (www. psyna ry. com), which enables routine collection of detailed patient-rated outcomes to accelerate the cycle of treatment optimisation [15]. Such a platform is readily scalable to extend specialist expertise to primary healthcare settings, particularly to support non-specialist clinicians to meet the enormous clinical need associated with common mood and anxiety disorders. The system has been developed and piloted in public and private specialist clinics in New Zealand and Japan. It is free to access and use.
This paper describes the outcomes from two parallel studies designed to validate the primary depression outcome measure, the R8 Depression, developed specifically for the Psynary system. This novel patient-rated outcome measure was designed to fulfil the following key requirements: 1. Items encompass symptoms commonly attributed to clinical domains of: mood; psychomotor changes; vegetative symptoms (sleep and appetite); cognitive symptoms (e.g. concentration and forgetfulness); and anxiety; 2. Items must be easy to understand for patients and there must be clear reference points for scoring each item; 3. Items must have clinical utility for clinicians and cover the range of clinical questions typically addressed when assessing depression severity; 4. Cut-offs for the total score must align with National Institute of Clinical Excellence (NICE) definitions of depression severity, enabling the use of this metric to stage interventions for depression in accordance with the NICE guidelines [10]; 5. The total scores from the repeated completion of the novel questionnaire must be sensitive to the clinical effects of treatment and, therefore, must reliably detect a significant response to treatment at an early stage.
This study focuses on the internal validation, external validation and factor analysis of the R8 Depression scale.

Study design
OptiMA1-NZ and OptiMA1-Japan are parallel studies adopting the same methodology in New Zealand and Japan respectively to validate the primary outcome measures used by an online system, Psynary. This paper describes the validation of its primary depression outcome measure: the R8 Depression. Data from both OptiMA1-NZ and OptiMA1-Japan studies were combined for analysis in this paper. The two studies were approved by the clinical research ethics committees of University of Otago (New Zealand) and Asai Hifuka Institutional Review Board (Japan).
Participants were recruited from patients registered on the online Psynary system by the public community mental health clinic at Nelson Marlborough District Health Board (NMDHB) in New Zealand and by the private clinic serving the Tokyo English-speaking expat population, American Clinic Tokyo (ACT) in Japan. Patients with probable mood or anxiety disorders who registered to the online Psynary system between March 24th 2016 and October 25th 2018 were invited to complete either an online or written consent process prior to participating in the study. Inclusion criteria included completing psynary in the English language, being over 18 years of age for NZ, or 20 years of age for Japan, and having an ICD-10 (International Classification of Diseases) [16] diagnosis of a current depressive episode (unipolar or bipolar) or anxiety disorder (ICD-10 F31.3, F31.4, F31.81, F32.1, F32.2, F33.1, F33.2, F40-F43) confirmed by the treating clinician at their initial appointment.
As part of the Psynary assessments, participants were guided through and asked to complete the R8 Depression as well as the Patient Health Questionnaire (PHQ-9) [17]. A maximum of 6 assessments were included for each participant; one baseline assessment and up to five follow-up/review assessments if completed within 90 days from baseline. Review data were included to ensure the full range of depressive symptom severities was captured as patients progressed through their recovery.
The PHQ-9 was selected as the primary comparison measure for the external validation due to its widespread international use in primary care, which is the target clinical environment for use of Psynary. There were a priori concerns that the restricted number of items of the PHQ-9 may limit its sensitivity, so, where possible, Quick Inventory of Depressive Symptomatology (QIDS-SR16) questionnaires, which consists of 16 items, were also collected in a smaller sub-sample at the Tokyo clinic, thus providing a further opportunity to triangulate external validation between the questionnaires. Licensing restrictions prevented QIDS from being incorporated into the electronic platform.
The Psynary system is completely anonymous with patients being allocated a username (a colour and a number) on registration and a temporary password which they change after their first login. All patients complete a generic consent process when using Psynary for the first time. Additional consent forms for the OptiMA1 studies were completed. It is worth noting that Psynary does not collect any personal identifiable information. For instance, Psynary collects approximate age (to the nearest year) but does not collect the date of birth. Participating clinics keep their own records linking Psynary usernames with patient identification, which is held on their clinical information systems and not shared with Psynary.
The clinics using Psynary initiated and optimised treatments for mood and anxiety disorders based on established local and international guidelines [18]. Patients were encouraged to complete Psynary reviews every 1 to 2 weeks or prior to clinic appointments. The baseline assessment takes 40 min to complete on average while review assessments take 10 min.
In addition to the clinical sample, a normative sample was recruited, comprising friends, relatives and work colleagues of the international research teams. Personal e-mail invitations were cascaded as per the research ethics framework of the studies. A description of the study and a link to the online R8 Depression and PHQ-9 questionnaires were provided in the e-mail, which was entirely anonymous. In addition to the R8 Depression and PHQ-9, information regarding gender, age, nationality, first language and educational background was also collected for the normative sample. Finally, participants in the normative arm of the study were asked about current and past treatment for depression and family history of depression.

Description of clinical metrics
The R8 Depression is a 30-item-questionnaire that covers all the symptom domains included in the ICD-10 classification of depression, as well as commonly associated symptoms, e.g. anxiety, and symptoms associated with melancholia and atypical depression. Each symptom item is scored on a 0 to 3 scale. For items covering appetite increase or reduction and weight gain or loss, the highest scores on either item are used. Therefore, 28 items are summed to give the total score, the maximum score being 84. To ease interpretation on Psynary the R8 Depression score is calculated as a percentage of this total score. The development of the R8 Depression is described below, and the questionnaire is reproduced in "Appendix".
The PHQ-9 is a widely used international measure of depressive symptoms used to screen for depression and measure outcomes to treatment [19]. It is a 9-item-scale with each item rated from 0 to 3 and individual items summed to generate a total score. There are well-established thresholds for remission and mild, moderate, moderately severe and severe depression.
The QIDS-SR16 is a 16-item-scale developed from the larger 30-item Inventory of Depressive Symptomatology (IDS-SR30). It is a self-reported questionnaire which is also widely used in practice and has strong psychometric properties with appropriate sensitivity to change [20].

Development of R8 Depression
The starting point for development of the R8 Depression was identifying a sufficient number of items to cover all the clinical domains included in the ICD-10 classification of depression: low mood (items 1 and 11); anhedonia (item 2); fatigue (item 6); poor concentration (item 28) and forgetfulness (item 10); reduced self-esteem/confidence; excessive guilt (item 8) and unworthiness (item 4); hopelessness (item 3); suicidal ideation (item 22); disturbed sleep; disturbed appetite. The somatic domain qualifier in ICD 10 equates to melancholic depression and includes symptoms of: loss of emotional reactivity, early morning awakening; psychomotor agitation (item 13) or retardation (item 9); weight loss (item 7); loss libido (item 15). Further items were then added to encompass important symptoms or problems routinely enquired about in specialist psychiatry clinics when assessing presentations of depression. For instance, specific reference to health anxiety (item 14) was included due to the prevalence of this symptom in depression, particularly amongst older patients [21]. Due to the prevalence of somatic symptoms as proxy-presentations for depression across many cultures item 20 refers to experience of unpleasant physical symptoms [22]. As anxiety symptoms are reported as occurring in up to 90% of patients presenting with depression [23], it is important to include an item relating to this (item 16). The identified subgroup of atypical depression tends to present with increased sleep and appetite, and hence the scale needed to determine both abnormal increases and decreases in vegetative symptoms and weight. Stemming from this, items relating to appetite (items 25 and 27) and weight (items 7 and 12) change were split into appetite reduction and increase, and weight loss and gain items. A logic operation was then introduced into the scoring of the R8 Depression to include only the highest rated of either of these pairs of items.
Due to sleep architecture being differentially affected in depression, it was important clinically to delineate falling asleep (item 30), sleep disturbance (item 26), early morning awakening (item 23), and increased sleep (item 18). This required four separate items. Aspects of social and daily functioning are commonly affected by depression and are an important focus of psychiatric assessment. Hence items of socialising (item 5), irritability (item 21), indecisiveness (item 29), motivation (item 24), daily activities (item 19), and sensitivity to criticism (item 17) have been included.
A four point Likert scale for items was already established as a standard amongst existing depression rating scales and it was decided to adhere to this standard. The anchor points are specifically defined for each item with the intent of mirroring the types of questioning used in psychiatric assessment. The extreme anchor points for each item were defined to indicate absence of that particular symptom or problem through to the most extreme clinical presentation. In this respect, it was important to draw upon clinical psychiatric expertise of the types of severe depression often requiring inpatient admissions and treatments such as Electroconvulsive Therapy (ECT). An example is the upper anchor point for the item on guilt which refers to delusions of guilt. One expectation of the R8 Depression was its ability to capture all gradations of severity of depression seen across the clinical spectrum from primary care to secondary care settings, without encountering a ceiling effect for the ratings.
The wording of all anchor points for items went through many reviews and revisions. There was an initial attempt to use plain English and avoid convoluted sentences, often seen in other scales. There were multiple reviews of wording by non-specialist, non-clinicians in a normative sample and patients.
In particular, previous work on translating another widely used depression rating scale [24] had identified the importance of subtle phrase variations in distinguishing between the absence of a symptom or problem and the threshold for indicating its mild occurrence (i.e. scoring between 0 and 1 on an item). Statistical analysis of responses in the initial normative population field testing revealed outlier items with significantly increased frequency of scores greater than 0 where subtle adjustments to the phrasing of the second anchor point had to be made.
The process of translating the R8 Depression into Japanese, a language that is very precise and specific, revealed ambiguities in the initial English phrasing for certain item anchor points. Considering international translation early in a clinical scale's development is an important tool in further refining the clarity of item anchor points and the scale's generalizability globally.
Valid criticisms have been raised against current depression outcome measures employing simple summation of symptom item ratings to generate a total severity score, arguing that different symptoms may contribute differentially to illness burden, that there is evidence of differential variation if symptoms change over time during recovery, and that current unitary models of depression are likely to conceal important sub-syndromes or entirely separate conditions [25,26]. The Psynary platform will allow for the development of nuanced latent scoring approaches for the R8 Depression, that will map to important sub-syndromes, accurately and sensitively capture response to treatment and ultimately contribute to treatment response prediction. This is an expected stage of development once the database has reached a greater level of maturity. At this stage in the system development, however, it is important for the R8 Depression to reflect existing norms for outcome measures and to accurately map total scores onto existing definitions of remission and depression severity and hence to integrate within widely used clinical guidelines that used such severity categories to stage treatments for depression [10].

Analysis
The distributions of the R8 Depression and PHQ-9 scores were calculated across the various samples, to assess the performance of these metrics across varying presentations of depression severity, particularly to identify potential ceiling or floor effects associated with the measures, and to evaluate the suitability of either parametric or non-parametric analyses.
The Cronbach's alpha coefficient was calculated to assess the internal reliability for the completed assessments, which enabled direct comparison with other clinical measures that have been published, including depression questionnaires [27,28]. Given the non-Gaussian data distributions when the baseline data was combined with the reviews and normative data, the Guttman's lower bounds method [29] was also computed for further means of internal validation.
To explore the underlying factors of the R8 Depression and to understand the distribution of questions per factor, a conjunction of Principal Components Analysis (PCA) and Exploratory Factor Analysis (EFA) was implemented. Baseline data was used and included all variables except libido due to the low loadings in factors and lowering of the explained variance. Weight gain/weight loss and increased appetite/loss of appetite variables were reduced to one variable each called "weight change" and "appetite change". PCA was first used to ascertain the validity of the component reduction procedure and to quantify the number of factors that are underlying in the data. Due to the highly correlated factors, the Direct Oblimin (Delta = 0) rotation method was applied, obtaining results that satisfy the assumption of sampling adequacy of the whole dataset via the Kaiser-Meyer-Olkin (KMO) test. To further ascertain the contents of these factors, EFA was run with the same parameters as the PCA. The Kaiser criterion of including factors with Eigenvalues greater than 1 was set as the method for determining factor solutions prior to the analysis.
While external reliability of the R8 Depression was assessed by calculating the Pearson's product-moment correlation coefficients in line with most other depression rating scale validation studies, the non-normal distribution of the data set should preference the use of Kendall's tau [30,31]. Both these tests were used to examine correlations between: (a) R8 Depression scores and PHQ-9 scores; and (b) R8 Depression scores and QIDS scores. Tests were two-tailed and the p significance value was 0.05. The external validity was tested for three progressively larger datasets: baseline assessments only for the clinical sample, both baseline and review assessments for the clinical sample, and clinical (baseline and reviews) and normative samples. These datasets were expected to represent different distributions of depression severity. The baseline clinical sample was expected to represent the more severe range of depression. The baseline and review samples, i.e. the total clinical sample, was expected to include patients who had experienced various degrees of recovery and, therefore, to be skewed towards mild and moderate depression. The largest sample including the normative data was expected to show a distribution of depression symptom severity skewed more towards remission. These different samples were analysed to quantify the effect of varying distributions of depression severity on the scores of external validities. The QIDS sample enabled a triangulation to assess external validity between the R8 Depression, PHQ-9 and QIDS.
For the combined clinical and normative sample used to establish severity categories for the R8 Depression, there was sufficient improvement in the uniformity of the data for an assumption of normality to be fulfilled in regards to the use of the linear regression model. Several comparisons were made between the R8 Depression, PHQ-9 and QIDS with sub-samples represented as a linear regression equation.
All statistical analyses were performed using SPSS version 24 (IBM Corp. Released 2016. IBM SPSS Statistics for Windows, Version 24.0. Armonk, NY: IBM Corp.).

Results
The study included data from 270 patients, 62 of which were from the New Zealand clinic and 208 participants from the Tokyo clinic. Data from 854 follow-up assessments were included, accounting for a total of 1124 clinical Psynary assessments. A total of 193 QIDS questionnaires were completed to assess a second further validation against current gold standard clinical measures.
A normative sample of 204 participants completed the R8 Depression and of these 187 also completed the PHQ-9.

Sample characteristics
53.8% (n = 144) of the 270 patients were female. The mean age was 34.3 years and Table 1 shows the distribution of patients in different age groups (range 18-72 years). The mean duration of episode of mood or anxiety disorder prior to presentation to the clinics was 20.6 months. Nearly a third of patients (31.5%) had more than one treatment change prior to visiting the clinic. Their employment status is shown in Table 1, with a total of 70.4% reporting being in employment or self-employed (both part-time and full), which shows retainment of a reasonable degree of functioning amongst the sample.
For the normative sample (n = 185), 71% of the patients were female, with a median age range of 30 to 39 years. 6% of the normative sample reported to be currently receiving treatment for depression, whilst 21% had a past history of treatment for depression and 46% had a family history of depression. This is expected given the prevalence of depression in the community [32].
The clinical diagnoses determined by the treating clinician are shown in Table 2. 34.9% were diagnosed with a moderate depressive episode and 52.6% were diagnosed with a severe depressive episode. Those who were determined not to have a depressive episode (7.8%) or mild episode (12.4%) had an anxiety disorder as their primary diagnosis.
For the 270 baseline clinical Psynary assessments completed, the mean R8 Depression score was 38.6 (± 16.7) (maximum score of 84) and the mean PHQ-9 score was 13.8 (± 7.7). Table 3 shows how the means of the total scores changed from baseline through consecutive reviews for both the R8 Depression and PHQ-9.
The distribution of the R8 Depression and PHQ-9 scores at baseline (Fig. 1) and distribution of baseline plus review scores (Fig. 2) are shown. Figure 1 shows that at baseline, the R8 Depression shows a clearer normal distribution without the ceiling effect apparent in the PHQ-9 distribution. Figure 2 shows the total clinical sample including the reviews, with a clear negative skew, expected as patients recover from their depression. The R8 Depression appears to show less of a floor effect compared to the PHQ-9. Table 4 shows the Cronbach's alpha coefficient was 0.91 for the R8 Depression compared to 0.88 for the PHQ-9 for the baseline sample, which is high in comparison to published studies [20,27,28].

Internal validity
The Guttman's lower bounds reliability scores, appropriate for the data distributions of interest, across the baseline plus reviews sample was high at 0.92 for the R8 Depression compared to 0.88 for the PHQ-9, as well as for the whole clinical sample (baseline plus reviews  Post-traumatic stress disorder 4.1 (11) plus normative) with scores of 0.93 compared to 0.88, respectively. Further detail relating to internal validity of the R8 Depression and differences between normative and clinical samples is included in the Additional file 1. This includes: the internal validity results for the R8 Depression and PHQ-9 separated for the normative and clinical samples; and significance testing of mean differences  between the normative and combined clinical samples for total R8 Depression scores for the identified six subdomains of the R8 Depression, and also for the total PHQ9 scores. Further to satisfying the assumption of sampling adequacy of the whole dataset via the Kaiser-Meyer-Olkin (KMO) test (KMO = 0.906, Table 3), the results from a factor analysis performed via Principal Components Analysis [33] showed that a six factor or component solution accounts for 58.6% of the variance in the data. When Exploratory Factor Analysis was conducted with the same parameters as the PCA (Table 5), the following factors were obtained: 1. Low mood (32.7%); 2. Sleep disturbance (6.7%); 3. Low energy (5.7%); 4. Appetite and weight change (5.1%); 5. Poor cognition (4.4%), and 6. Anxiety (4.0%) Further information relating to the factor analyses is included in the Additional file 2. This includes: correlations between extracted R8 Depression factors and PHQ-9 scores for the whole sample; mean extracted R8 Depression factors and PHQ-9 scores at baseline and subsequent reviews; and the pattern matrices for factor analyses of the R8 Depression in the normative sample and baseline clinical samples. Table 6 shows the Pearson's correlation coefficients between the R8 Depression and the PHQ-9 for baseline sample (0.83, p < 0.001), baseline plus reviews sample (0.90, p < 0.001), and the clinical plus normative data samples (0.91, p < 0.001). For the opportunistic QIDS sample (n = 193), Pearson's correlation coefficient was 0.90 between R8 Depression and PHQ-9, 0.84 between R8 Depression and QIDS, and 0.80 between PHQ-9 and QIDS. The more appropriate and rigorous Kendall's tau coefficient was lower but still highly significant (< 0.001) at 0.76 for the entire combined clinical and normative sample.

External validity
Further information relating to external validity is included in the Additional file 3. This includes: a table of correlation coefficients between total R8 Depression and PHQ-9 scores; and scatter plots between the total scores of the R8 Depression and the PHQ-9, for the normative sample, baseline clinical sample and the baseline plus review clinical sample. Figure 3 shows a plot of the R8 Depression scores against the PHQ-9 scores for the whole clinical and normative datasets. This plot was used to help determine the threshold values between the four categories of "no depression", "mild depression", "moderate depression" and "severe depression" corresponding to the NICE treatment guidelines, as shown in Table 7.

Discussion
The R8 Depression was developed specifically for use as part of an automated online assessment and monitoring platform, Psynary, to assist clinicians optimising treatments for common mood and anxiety disorders. The scale was designed to cover the full range of severity presentations of depression encountered in both primary and secondary care settings. The items were informed by specialist clinical practice to encompass all the domains of depressive symptoms, and were worded to achieve optimal ease of use for patients, relevance to clinical assessments, and ease of translation to other languages. OptiMA1 has established the internal and external validation of this measure.
The parallel studies, sited in both private care and specialist public care settings helped to ensure a broad spectrum of presentations of depression and a representative clinical sample. The inclusion of baseline and review assessments captured the patient journey towards recovery and also ensured a wide distribution of severity scores.
A detailed comparison of the distributions of scores clearly highlights an important ceiling effect in the PHQ-9 (Figs. 1, 2), which limits this questionnaire's ability to discriminate the more severe presentations of depression. Conversely, the R8 Depression approximates a normal distribution of scores more closely at baseline  Fig. 1), capturing and delineating the more severe cases of depression and not exhibiting any ceiling effects. This property is likely to be important in accurately detecting subtle changes in depression severity early during treatment.
The internal validity of the R8 Depression is excellent, exceeding that of the PHQ-9. The Principal Component Analysis together with the Exploratory Factor Analysis suggests a sub-scale structure that aligns with a clinically meaningful separation of depressive symptoms into mood, psychomotor, neuro-vegetative, cognitive and anxiety domains. This in part replicates previous factor analyses of depressive questionnaires used in STAR*D and Genome-based Therapeutic Drugs for Depression (GENDEP) clinical studies. In the latter, the psychomotor domain encompassing symptoms of interest and activity,  appeared to be important in predicting poor response to antidepressant treatments [34]. Future analyses of naturalistic outcomes in the OptiMA1 will attempt to replicate these results. A highly significant degree of external validity was demonstrated, with the Pearson correlation coefficient between the R8 Depression and the PHQ-9 with the larger clinical and normative datasets exceeding 0.90 for the entire sample. When adopting the more robust test for the non-normal distribution of interest with significant outliers, i.e. Kendall's tau method [30], the correlation coefficient is lower (0.76) but still highly significant (p < 0.001). The sample that included paper-based administration of the QIDS allowed for a useful triangulation to further ensure external validity, with the R8 Depression showing higher correlation with the QIDS than the correlation between the PHQ-9 and the QIDS (Pearson's correlation coefficient = 0.84 vs 0.80).
While there has been an implicit acceptance of deviations from normality in the distribution of depression severity scores in the literature of scale validations, this paper has been explicit in its testing of assumptions of  normality for the dataset. While it is reasonable to continue the tradition and widespread use of parametric analyses such as Pearson's correlation and Cronbach's alpha, this paper would advocate for the inclusion of more rigorous non-parametric approaches as well. The use of large paper-based questionnaires in a clinical setting is problematic due to the constraints of time availability and complexity of calculating total scores, hence the widespread adoption of simpler tools such as the PHQ-9. The development of the R8 Depression within the automated web-based environment of Psynary obviates these constraints, enabling patients to rate their outcomes in their own time and at their own pace, away from the time-limited environment of the clinic. Importantly, this opens the opportunity of routine and detailed tests of patient-rated outcomes for depression in a clinical setting. The system captures and retains this information allowing for quantitative feedback of treatment response over time. This advantage has the potential of facilitating the creation of a framework to enable realtime routine measurement of patients' symptoms to aid early detection of treatment response and a faster optimisation of treatment regimes.
The online Psynary platform also enables a cost-effective means of conducting clinical studies, including automation of the consent process in those jurisdictions that allow. This opens up the potential for recruiting the large clinical samples that will be needed for future clinical studies in mental health, particularly to develop strategies to personalise treatments and achieve rapid optimisation.
Having validated the R8 Depression questionnaire, the next step will be establishing the measure's sensitivity to change. This is an essential prerequisite to extending the use of the Psynary system to aid early detection of treatment response in patients with depression.

Conclusion
The R8 Depression has been developed as a patient-rated outcome measure for depression that is automatically administered via an online system called "Psynary". It captures all the symptom domains of depression and, as validated in this study, it correlates well with current gold standard clinical scores, and has excellent internal and external reliability. It also has the potential for accurate measurement of early treatment response.
Future work is underway to assess the sensitivity to change and the predictive value for treatment optimisation.

R8 Depression items
In general, over the past week…