miércoles, 9 de mayo de 2018

Using SNQuery to test FHIR subsets

When defining Snomed CT subsets, the most common approach is to define them in an extensional way. This is also the case for national bodies or subsets defined in FHIR specifications. These extensional subsets definition have several potential problems, such as missing out concepts out of the subset or the existence of homograph words that cause the choosing of wrong terms. There is a missing opportunity for the use of Snomed Expressions Constraints for subset definitions. We will use SNQuery to demonstrate how we can create or validate existing subsets in an easy way.

Subsets with logical definition (is-a)


The first example, is those subsets which also contain a logical definition with "is-a" relationships such as medication codes valueset or body site valueset
For medication codes valueset, the definition is as follows:

This definition can be easily translated to Expression Constraint Language like this
<< 410942007 |Drug or medicament (substance)| OR << 373873005 |Pharmaceutical / biologic product (product)| OR << 106181007 |Immunologic substance (substance)|
 This expression in the latest substrate available (20180131) contains 28928 concepts

Lists of codes


Several FHIR subsets are defined as sets of codes (such as bodysite relative location). The main problem with this approach is that subsets may be incomplete, or be potentially wrong, as hand picked words could have several meanings (e.g. the same term can be used to represent the procedure and the tissue where the procedure is made). These errors are revealed easily by looking to some graphs. In case of the bodysite relative location, the equivalent expression is the following one:
419161000 or 419465000 or 51440002 or 261183002 or 261122009 or 255561001 or 49370004 or 264217000 or 261089000 or 255551008 or 351726001 or 352730000






This graph shows where most of the concepts fall. In this case, all bodysite relative locations are qualifier values, which seems correct.

These kinds of visualizations become more useful, the more concepts the subset has. For example facility codes subset, which contains 79 concepts.





The last graph can be interpreted as follows: Of the 79 concepts contained in the subset  94% (74 out of 79) are a 'site of care', with 4 of the remaining ones being 'community environment' and a single one is a 'hospital environment'. With this kind of visualization, some questions can be raised: That single code in the hospital environment subtree should be referring to itself and all the allowed children? Could we get away with simplifying the subset to the expression "< 276339004 |Environment (environment)|" which contains all the environments known in Snomed? Do the terms annotated in the original FHIR subset with "--OTHER--NOT LISTED" should always be translated into a children or self operation?

Validating subsets

One of the advantages of this approach is that the graphical representation also allows for a quick review of the quality of the proposed subset. As an example the specimen collection method, which can be defined with the expression constraint

119295008 |Specimen obtained by aspiration| OR 413651001 |Bioptics| OR 360020006 |Extirpation - action| OR 430823004 |Examination of midstream urine specimen| OR 16404004 |Induced| OR 67889009 |Irrigation| OR 29240004 |Autopsy examination| OR 45710003 |Sputum| OR 7800008 |Punctate| OR 258431006 |Scrapings| OR 20255002 |Blushing| OR 386147002 |Smear procedure| OR 278450005 |Finger stick|

This subset contains 13 concepts, and shows the following graph:




One thing that can be quickly seen is that the focus concept (which can be interpreted as the minimum common ancestor) is the Snomed CT root concept. This means that concepts in the subset have no other common parent aside from the Snomed root concept. This serves as a sign that subset definition has potential problems.

A more detailed analysis shows that:
  • 5 / 12 concepts are in the procedure hierarchy, which seems fitting as we are talking about collection methods.
  • 3 / 12 concepts are in the qualifier value hierarchy
  • 2/ 12 concepts are in the specimen hierarchy
  • 1 / 12 concept is in the substance hierarchy
  • 1 / 12 concept is in the observable entity hierarchy 
  • 1 concept is inactive and shouldn't be used

 Taking a careful look at the concepts not in procedure hierarchy, some problems become apparent:
  • The concept "45710003 |Sputum (substance)|", which refers to the substance, is used to refer to the method of collection. In this case, the concept "37705003 |Collection of sputum (procedure)|" seems way more fitting to the purpose of the subset
  • The concept "20255002 |Blushing, function (observable entity)|"refers to an observable entity. It seems that the correct concept could be "225063006 | Flushing cannula (procedure) |"
  • Similar to the last one, "258431006 |Scrappings (specimien)|" and "Specimen obtained by aspiration (specimen)" seem to be unfitting for the purpose of the subset. Probably, "56757003 |Scraping (procedure)|" and "14766002 | Aspiration (procedure) |" should be used instead. 
  • The inactive concept "386147002 | Smear procedure (procedure) |" was made inactive because it was ambiguous. By reviewing the code in the Snomed Browser (see RefSet tab) we can see that code should be "448895004 | Sampling for smear (procedure) |" or "448938001 | Preparation of smear (procedure) |" (or both).
  • Regarding the selected qualifier values, the correct approach shouldn't be to add these qualifiers to the set, but to allow the procedures that use this qualifier value. This can be expressed as 71388002|Procedure (procedure)| which 260686004 |Method (attribute)| is any of the three qualifier values. Formally, the expression for these qualifier values would end as:
 <71388002|Procedure (procedure)|:260686004 |Method (attribute)|=(16404004 |Induced (qualifier value)| OR 360020006 |Extirpation - action (qualifier value)| OR 7800008 |Punctate (qualifier value)|)
Note: Even if Induced and Punctuate terms are not currently used as destination of any attribute in Snomed CT, this expression constraint allows to validate the post-coordination of procedures that use these methods
In addition to validating the subset we should also ask ourselves again if there is some expression that could be used to group terms, and by doing so, simplify the expression.

Conclusion 

As the examples show, these subsets' visualizations could help clinical experts in the subset definition, validation, and curation. Extensional subsets may seem easy to define, but could led to potential problems if the hierarchy is not taken into account when defining the subset. Even if (arguably) one of the advantages of defining extensional subsets is to limit the possible inputs in a form, official provided subsets should always try to include all and every term useful for the subset to ease interoperability. When implementing the subset in a given organization is always better to further refine a subset than to extend it with terms not originally included in it.