functional testing

But does it measure what I want it to?

While there are thousands of assessment tools available for various aspects of pain and function, one of the most important things to consider is content validity – does the assessment measure what I want it to measure? Reliability is all very well, and ensures accuracy, but if the test doesn’t measure anything useful or important, then it’s not going to be very helpful!

This article, published in 2006, is one of the few that seeks to conduct a qualitative evaluation of the content of several questionnaires but base it on a reasonably sound theoretical framework with relatively solid methodology to ensure other researchers can conduct the same process. So far, however, I haven’t found much to compare it with – but it’s a helpful study in terms of helping clinicians define exactly what they want to include in an assessment battery, even if it concludes that there are gaps in the existing repertoire!

Sigl, Cieza, Brockow, Chatterji, Kostanjsek, and Stucki set about comparing three very common low back pain measures using the the International Classification of Functioning, Disability and Health (ICF) approved by the World Health Assembly in May 2001. Their intention was twofold: to review whether three common instruments cover the areas in the ICF, and whether the ICF can function as a somewhat atheoretical framework for comparing different instruments.

Just to review the ICF, the ICF is a multipurpose classification belonging to the WHO family of international health classifications. Part 1 covers functioning and disability and includes the components ‘‘body functions’’ (b) and ‘‘structure’’(s) and ‘‘activities and participation’’ (d). Part 2 covers contextual factors and includes the components ‘‘environmental factors’’ (e) and ‘‘personal factors.’’

To quote directly from the WHO, ‘The ICF puts the notions of ‘health’ and ‘disability’ in a new light. It acknowledges that every human being can experience a decrement in health and therheby experience some degree of disability. Disability is not something that only happens to a minority of humanity. The ICF thus ‘mainstreams’ the experience of disability and recognises it as a universal human experience. [my emphasis – BFT] By shifting the focus from cause to impact it places all health conditions on an equal footing allowing them to be compared using a common metric – the ruler of health and disability. Furthermore ICF takes into account the social aspects of disability and does not see disability only as ‘medical’ or ‘biological’ dysfunction. By including Contextual Factors, in which environmental factors are listed, ICF allows to record the impact of the environment on the person’s functioning.

I quite like the ideal of ‘everyone’ having both limitations and abilities, and especially the idea that limitations are contextual. I’m not sure that this model has yet had an impact on the systems in which we usually work, however! I use the idea that everyone has abilities and everyone has limitations when working with people experiencing chronic pain – it has the effect of encouraging people to focus on their abilities rather than defining themselves by their limitations. The flow from conceptual ideals to measurement and implementation of these ideas takes time, and because it’s a nonmedical concept, unlikely to have a significant impact on health delivery systems for many years yet.

Back to the article…
The methodology is well-described in the article – three clinicians already trained in the ICF were used. Two reviewed the content, and linked the items in the questionnaire to a content area in the ICF, applying 10 different linking rules to the items, and then compared the identified concepts and selected ICF categories to establish a Kappa statistic. If disagreement existed occurred, a third person trained in the ICF and in the linking rules was consulted, and independently determined how the item should be classified.

Clear guidelines on how linkages were to be developed, although these are not provided in the article itself – several examples, however, demonstrate how different items were allocated categories, for example, ‘If an item of a measure contains more than one concept, each concept has to be linked separately. For example, in the item of the ODI ‘‘Pain doesn’t prevent me from walking any distance,’’ the concepts ‘‘pain’’ and ‘‘walking’’ were linked to ‘‘b28013 pain in back’’ and ‘‘d450 walking,’’ respectively. The response options of an item are linked to the ICF if they refer to concepts other than those contained in the corresponding item. For example, in the item 14 ‘‘sleeping’’ of the NASS, in which two of the response categories of the item are ‘‘I sleep well’’ and ‘‘pain interrupts my sleep,’’ the concept ‘‘sleeping’’ was linked to the ICF category ‘‘b134 sleep functions,’’ the concept ‘‘sleep well’’ to ‘‘b1343 quality of sleep,’’ and the concept ‘‘interrupts my sleep’’ to ‘‘b1342 maintenance of sleep.’’ If an item/concept is not contained in the ICF classification, it is labeled ‘‘nc’’ (not covered by the ICF). ‘‘nc’’ does not differentiate between concepts relating to function not covered by the ICF, concepts relating to personal factors for which no categories currently exist, and other concepts relating to aspects like time and space.’

Although this sounds tedious to read here, I’m certain that the process ensures precision and enables the majority of items to be appropriately categorised.

Well the first thing to establish is whether the two (and occasionally three) clinicians agreed on the categories in which they allocated items. The Kappa statistics, with adjustment made for the skewdness of the sample (from high Kappa values and small sample size) by using a bootstrapping technique of sampling from percentiles based on the observed data, was used to determine agreement. The results showed that the range of agreement was from 0.67 at the broadest level of category through to 1.0 (or total agreement) at the fourth level. To illustrate this, an example selected from the component ‘‘body functions’’ is presented below:
b2: Sensory functions and pain (first level) – at this level there was a small level of disagreement
b280: Sensation of pain (second level)
b2801: Pain in body part (third level)
b28013: Pain in back (fourth level)
b28018: Pain in body part, other specified (fourth level) – at this level, there was total agreement

This demonstrates very good inter-rater reliability, although it should be appreciated that there were only three individuals involved. A larger number of raters would have provided a much better determination of the accuracy of this approach to content validation – but would also increase the time required to do it!

Now, for the real work of this study: what areas were covered by the three assessment tools, and which areas were not well-covered?

  • The representation of body functions is similar in all three measures incorporating pain and sleep.
  • All three questionnaires contain a similar number of concepts representing the ICF component “activities and participation.’’
  • None of the selected instruments covered aspects of remunerative work (d850). ‘‘Domestic life, other specified’’ (d698), which had to be linked for carrying out household tasks (‘‘doing any of the jobs that I usually do around the house,’’ ‘‘heavy jobs around the house’’), is applicable only for the RMQ.

The two research questions were: whether three common instruments cover the areas in the ICF, and whether the framework was a useful way to determine content.

  1. It was found that yes, all three instruments cover aspects of the ICF – to varying extents. Only one looked at the psychological impact of pain, and none looked at factors such as fatigue that are well-known to be associated with poorer function. Interestingly, none of the measures looked at ‘context’ – for example, ‘attitudes of immediate family members or friends or society are important prognostic determinants for life satisfaction, work performance, and disability in patients with back pain. This also holds true for remunerative work, which is not covered by any of the measures.’
  2. The second question was whether the ICF could be helpful as a framework – one use of this type of comparison work is to create an item bank. Item banks consist of large sets of questions representing various levels of a latent variable that can be used to develop brief, efficient scales for measuring that latent variable. Using Rasch analysis, items the measure the variable of interest can be identified and selected to form a measurement tool that precisely assesses that specific level of function.
  3. The first finding alone is interesting – why have these very important areas of function been ignored? Does this reflect the western idea that ‘the person with the disability’ exists in isolation?

    The final comment I want to make is about the usefulness of this research from a clinical perspective. Key areas that are well-known to be important both to people with pain, and to funders of health care and compensation are not included in three commonly-used assessment tools. Perhaps if these agencies could see their way to fund this type of comparison, it might be possible to develop supplementary measures to ensure this information is available for use in clinical situations.

    Sigl, T., Cieza, A., Brockow, T., Chatterji, S., Kostanjsek, N., Stucki, G. (2006). Content Comparison of Low Back Pain-Specific Measures Based on the International Classification of Functioning, Disability and Health (ICF). Clinical Journal of Pain, 22(2), 147-153.

    World Health Organization. International Classification of Functioning,
    Disability and Health: ICF. Geneva: WHO, 2001.
    Schultz IZ, Crook JM, Berkowitz J, et al. Biopsychosocial multivariate
    predictive model of occupational low back disability. Spine. 2002;27:

    Takeyachi Y, Konno S, Otani K, et al. Correlation of low back pain with
    functional status, general health perception, social participation, subjective
    happiness, patient satisfaction. Spine. 2003;28:1461–1466.

‘its taken over my life’…

Each time I spend listening to someone who is really finding it hard to cope with his or her pain, I hear the unspoken cry that pain has taken over everything. It can be heartbreaking to hear someone talk about their troubled sleep, poor concentration, difficult relationships, losing their job and ending up feeling out of control and at the mercy of the grim slave-driver we call chronic pain. The impact of pain can be all-pervasive, and it can be hard to work out what the key problems are.

To help break the areas down a little, I’ve been quite arbitrary really. I’m going to explore functional limitations in terms of the following:
1. Movement changes such as mobility (walking), manual handling, personal activities of daily living
2. Disability – participation in usual activities and roles such as grocery shopping, household management, parenting, relationships/intimacy/communication
3. Sleep – because it is such a common problem in pain
4. Work disability – mainly because this is such a complex area
5. Quality of life measures

The two following areas are ones I’ll discuss in a day or so – they’re associated with disability because they mediate the pain experience and disability…as I mentioned yesterday, they’re the ‘suffering’ component of the Loeser ‘rings’ model.
6. Affective impact – things like anxiety, fear, mood, anger that are influenced by thoughts and beliefs about pain and directly influence behaviour
7. Beliefs and attitudes– these mediate behaviour often through mood, but can directly influence behaviour also (especially treatment seeking)

There are so many other areas that could be included as well, but these are some that I think are important.
Before I discuss specific instruments, I want to spend yet more time looking at who and how – and the factors that may influence the usefulness of any assessment measure.

Who should assess these areas? Well, it’s not perhaps who ‘should’ but how can these areas be assessed in a clinical setting.

Most clinicians working in pain management (doctors, psychologists, occupational therapists, physiotherapists, nurses, social workers – have I missed anyone?) will want to know about these areas of disability but will interpret findings in slightly different ways, and perhaps assess by focusing on different aspects of these areas.

As I pointed out yesterday, there are many confounding factors when we start to look at pain assessment, and these need to be borne in mind throughout the assessment process.

How can the functional impact of pain be assessed?

  • Self report, eg interview, questionnaires – and the limitations of these approaches are reliability, validity threats as well as ‘motivation’ or expectancies
  • Observation, either in a ‘natural’ setting such as home or work, or a clinical setting
  • Functional testing, again either in a ‘natural’ setting such as home or work, or a clinical setting – and functional testing can include naturalistic procedures such as the AMPS assessment, formal and structured testing such as the 6 minute walk test, the sock test, or even certain functional capacity tests; or it may be clinical testing such as manual muscle testing or range of movement, or even Waddell’s signs

All self report measures, whether they’re verbal questions, interview or pen and paper measures are subject to the problem that they are simply the individual’s own perception of the degree of interference they attribute to pain. The accuracy of this perception can be called into question especially if the person hasn’t carried out a particular activity recently, but in the end, it is the person’s perception of their abilities.

All measures need to be evaluated in terms of their reliability and validity – how much can we depend on this measure to (1) assess current status (2) contribute to a useful diagnosis (or formulation) (3) provide a basis for treatment decisions (4) evaluate or measure function over time (Dworkin & Sherman, 2001).

Reliability refers to how consistently a measure performs over time, person, clinician.

Validity refers to how well a test actually measures what it says its measuring.  The best way to determine validity is if there is a ‘gold standard’ against which the test can be compared – of course in pain and functional performance, this is not easy, because there is no gold standard!  The closest we can come to is a comparison between, for example, a self report in a clinic on a pen and paper test compared with a naturalistic observation in a person’s home or workplace – when they’re not being observed.

Probably one of the best chapters discussing these aspects of pain assessment is Chapter 32, written by Dworkin & Sherman chapter in the 2nd Edition of the Handbook of Pain Assessment 2001 (DC Turk & R Melzack, Eds), The Guilford Press.

Importantly for clinicians working in New Zealand, or outside of North America and the UK, the reference group against which the client’s performance is being compared, needs to be somewhat similar to the population the client comes from.  Unfortunately, there are very few assessment instruments that have normative data derived from a New Zealand or Australasian population – and we simply don’t know whether the people seeking treatment in New Zealand are the same on many dimensions as those in North America.

I’m also interested in how well any instruments, whether pen and paper, observation or performance-based assessment translate into the everyday context of the person.  This is a critical aspect of pain assessment validity that hasn’t really been examined well.  For example, the predictive validity (which is what I’m talking about) of functional capacity tests such as Isernhagen, Blankenship or other systems have never been satisfactorily established, despite the extensive reliance on these tests by insurers.

Observation is almost always included in disability assessment. The main problems with observation are:
– there are relatively few formal observation assessments available for routine clinical use
– they do take time to carry out
– maintaining inter-rater reliability over time can be difficult (while people may initially maintain a high level of integrity with the original assessment process, it’s common to ‘drift’ over time, and ‘recalibration’ is rarely carried out)

While it’s tempting to think that observation, and even functional testing, is more ‘objective’ than self report, it’s also important to consider that these are tests of what a person will do rather than what a person can do (performance rather than capacity). As a result, these tests can’t be considered infallible or completely reliable indicators of actual performance in another setting or over a different time period.

Influences on observation or performance-based assessments include:
– the person’s beliefs about the purpose of the test
– the person’s beliefs about his or her pain (for example, the meaning of it such as hurt = harm, and whether they believe they can cope with fluctuations of intensity)
– the time of day, previous activities
– past experience of the testing process

And of course, all the usual validity and reliability issues.
More on this tomorrow, in the meantime you really can’t go far past the 2nd Edition of the Handbook of Pain Assessment 2001 (DC Turk & R Melzack, Eds), The Guilford Press.

Here’s a review of the book when the 2nd Edition was published. And it’s still relevant.


There are some very weird and crazy measures out there in pain assessment land… some of them take a little stretch of the imagination to work out how they were selected and what they’re meant to mean in the real world.

Functional measures are especially challenging – given that they are about what a person will do on a given day in a given setting, they are inherently prone to performance variation (test-retest reliability) and can’t really be held up as gold standards in terms of objectivity. Nevertheless, most pain management programmes are asked to provide measures of performance, and over the years I’ve seen quite a few different ones. For example, the ‘how long can you stand on one leg’ timed measure…the ‘sock test’ measure…the ‘pick up a crate from the floor and put it on a table’ measure…the ‘timed 50 m walk test’…the ‘step up test’… – and I could go on.

Some of these tests have normative data against age and gender, some even have standardised instructions (and some of these instructions are even followed!), and some even have predictive validity – but all measures beg the question – ‘why?’

I’m not being deliberately contentious here, not really… I think we as clinicians should always ask ‘why’ of ourselves and what we do, and reflect on what we do in light of new evidence over time. At the same time I know that each of us will come up with slightly different answers to the question ‘why’ depending on our professional background, experience, the purpose of the measure, and even our knowledge of scientific methodology. So, given that I’m in a thinking sort of mood, I thought I’d spend a moment or two noting down some of the thoughts I have about measures of function in a pain management setting.

  1. The first thing I’d note is that functional performance is at least in part, a measure of pain behaviour. That is, it’s about what a person is prepared to do, upon request, in a specific setting, at a certain time of day, for a certain purpose. And each person who is asked to carry out a functional task will bring a slightly different context to the functional performance task. For example, one person may want to demonstrate that their pain is ‘really bad’, another may want to ‘fake good’ because their job is on the line, another may be fearful of increased pain or harm and self-limit, while another may be keen to show ‘this new therapist just what it’s like for me with pain’. As a result, there will be variations in performance depending on the instructions given, the beliefs of the person about their pain – and about the way the assessment results will be used, and even on the gender, age and other characteristics of the therapist conducting the testing. And this is normal, and extremely difficult to control.
  2. The second is that the purpose of the functional performance testing must be clear to the therapist and the participant. Let’s look at the purpose of the test for the therapist – is it to act as a baseline before any intervention is undertaken? is it to be used diagnostically? (ie to help assess the performance style or approach to activity that the client has) is it to establish whether the participant meets certain performance criteria? (eg able to sustain manual handling safely in order to carry out a work task) is it to help the participant learn something about him or herself? (eg that this movement is safe, that this is the baseline and they are expected to improve over time etc).  And for the participant? Is this test to demonstrate that they are ‘faking’? (or do they think that’s what it’s about?) Is it to help them test out for themselves whether they are safe? Is it a baseline measure, something to improve on?  Is it something they’ve done before and know how to do, or is it something they’ve not done since before they hurt themselves? You see, I can go on!!
  3. Then the functional measures must be relevant to the purpose of the testing. It’s no use measuring ‘timed get up and go’, for example, if the purpose of the assessment is to determine whether this person with back pain can manage his or her job as a dock worker. Likewise, if it’s to help the person learn about his or her ability to approach a feared task, then it’s not helpful to have a standardised set of measures (unless this is a set that is taken pre-treatment and again at post-treatment). This means the selection of the measures should at least include consideration of predictive validity for the purpose of the test. For example, while a ‘timed get up and go’ may be predictive of falls risk in an elderly population, it may be an inappropriate measure in a young person who is being assessed for hand pain. It’s probably more useful to have a slightly inaccurate measure that measures something relevant than a highly accurate measure that measures something irrelevant. For example, we may know the normative data for (plucking something out of the air here…) ‘standing on one leg’, but unless this predicts something useful in the ‘real world’, then it may be a waste of time.
  4. Once we’ve determined a useful, hopefully predictive measure, then it’s critical that the assessment process is carried out in a standard way. That means the whole process, not just the task itself. What do I mean? Well, because there are multiple influences on performance, such as time of day, presence or absence of other people, and even the way the test is measured (eg If it’s timed with a stop-watch, when is the button pushed to start? When is it pushed to stop? Is this documented so everyone carries it out exactly the same way?) There is a phenomenon known as assessment drift (well, that’s what I call it!) where the person carrying out the assessment drifts from the original measurement criteria over time. This happens for all of us as we get more experienced, and as we forget the original instructions. Essentially we are a bit like a set of scales – we need to be calibrated just as much as any other piece of equipment. So the entire assessment needs to be documented right down to the words used, and the exact criteria used for each judgement.
  5. And finally, probably for me a plea from the heart – that the measures are recorded, analysed, repeated appropriately, and returned to the participant, along with the interpretation of the findings. This means the person being assessed gains from the process, not just the clinician, or the funder or requester of the assessment.

So over the Easter break (have a good one!), take a moment or two to think about the validity and reliability of the functional assessments you take. Know the confounds that may influence the individuals’ performance and try to take this into account when interpreting the findings. Consider why you are using these specific measures, and when you were last ‘calibrated’. Make a resolution: ask yourself ‘what will this measure mean in the real world?’ And if, as I suspect most of us know, your assessments don’t reflect the reality of carrying the groceries in from the boot of the car, or pushing a supermarket trolley around a busy supermarket, or squeezing the pegs above the head to hang out the washing – well, there might be a research project in it!!