Standards Mania: 2018

miércoles, 9 de mayo de 2018

Using SNQuery to test FHIR subsets

When defining Snomed CT subsets, the most common approach is to define them in an extensional way. This is also the case for national bodies or subsets defined in FHIR specifications. These extensional subsets definition have several potential problems, such as missing out concepts out of the subset or the existence of homograph words that cause the choosing of wrong terms. There is a missing opportunity for the use of Snomed Expressions Constraints for subset definitions. We will use SNQuery to demonstrate how we can create or validate existing subsets in an easy way.

Subsets with logical definition (is-a)

The first example, is those subsets which also contain a logical definition with "is-a" relationships such as medication codes valueset or body site valueset
For medication codes valueset, the definition is as follows:

Include codes from http://snomed.info/sct where concept is-a 410942007 (Drug or medicament)

Include codes from http://snomed.info/sct where concept is-a 373873005 (Pharmaceutical / biologic product)

Include codes from http://snomed.info/sct where concept is-a 106181007 (Immunologic substance)

This definition can be easily translated to Expression Constraint Language like this

<< 410942007 |Drug or medicament (substance)| OR << 373873005 |Pharmaceutical / biologic product (product)| OR << 106181007 |Immunologic substance (substance)|

This expression in the latest substrate available (20180131) contains 28928 concepts

Lists of codes

Several FHIR subsets are defined as sets of codes (such as bodysite relative location). The main problem with this approach is that subsets may be incomplete, or be potentially wrong, as hand picked words could have several meanings (e.g. the same term can be used to represent the procedure and the tissue where the procedure is made). These errors are revealed easily by looking to some graphs. In case of the bodysite relative location, the equivalent expression is the following one:

419161000 or 419465000 or 51440002 or 261183002 or 261122009 or 255561001 or 49370004 or 264217000 or 261089000 or 255551008 or 351726001 or 352730000

This graph shows where most of the concepts fall. In this case, all bodysite relative locations are qualifier values, which seems correct.

These kinds of visualizations become more useful, the more concepts the subset has. For example facility codes subset, which contains 79 concepts.

The last graph can be interpreted as follows: Of the 79 concepts contained in the subset 94% (74 out of 79) are a 'site of care', with 4 of the remaining ones being 'community environment' and a single one is a 'hospital environment'. With this kind of visualization, some questions can be raised: That single code in the hospital environment subtree should be referring to itself and all the allowed children? Could we get away with simplifying the subset to the expression "< 276339004 |Environment (environment)|" which contains all the environments known in Snomed? Do the terms annotated in the original FHIR subset with "--OTHER--NOT LISTED" should always be translated into a children or self operation?

Validating subsets

One of the advantages of this approach is that the graphical representation also allows for a quick review of the quality of the proposed subset. As an example the specimen collection method, which can be defined with the expression constraint

119295008 |Specimen obtained by aspiration| OR 413651001 |Bioptics| OR 360020006 |Extirpation - action| OR 430823004 |Examination of midstream urine specimen| OR 16404004 |Induced| OR 67889009 |Irrigation| OR 29240004 |Autopsy examination| OR 45710003 |Sputum| OR 7800008 |Punctate| OR 258431006 |Scrapings| OR 20255002 |Blushing| OR 386147002 |Smear procedure| OR 278450005 |Finger stick|

This subset contains 13 concepts, and shows the following graph:

One thing that can be quickly seen is that the focus concept (which can be interpreted as the minimum common ancestor) is the Snomed CT root concept. This means that concepts in the subset have no other common parent aside from the Snomed root concept. This serves as a sign that subset definition has potential problems.

A more detailed analysis shows that:

5 / 12 concepts are in the procedure hierarchy, which seems fitting as we are talking about collection methods.
3 / 12 concepts are in the qualifier value hierarchy
2/ 12 concepts are in the specimen hierarchy
1 / 12 concept is in the substance hierarchy
1 / 12 concept is in the observable entity hierarchy
1 concept is inactive and shouldn't be used

Taking a careful look at the concepts not in procedure hierarchy, some problems become apparent:

The concept "45710003 |Sputum (substance)|", which refers to the substance, is used to refer to the method of collection. In this case, the concept "37705003 |Collection of sputum (procedure)|" seems way more fitting to the purpose of the subset
The concept "20255002 |Blushing, function (observable entity)|"refers to an observable entity. It seems that the correct concept could be "225063006 | Flushing cannula (procedure) |"
Similar to the last one, "258431006 |Scrappings (specimien)|" and "Specimen obtained by aspiration (specimen)" seem to be unfitting for the purpose of the subset. Probably, "56757003 |Scraping (procedure)|" and "14766002 | Aspiration (procedure) |" should be used instead.
The inactive concept "386147002 | Smear procedure (procedure) |" was made inactive because it was ambiguous. By reviewing the code in the Snomed Browser (see RefSet tab) we can see that code should be "448895004 | Sampling for smear (procedure) |" or "448938001 | Preparation of smear (procedure) |" (or both).
Regarding the selected qualifier values, the correct approach shouldn't be to add these qualifiers to the set, but to allow the procedures that use this qualifier value. This can be expressed as 71388002|Procedure (procedure)| which 260686004 |Method (attribute)| is any of the three qualifier values. Formally, the expression for these qualifier values would end as:

<71388002|Procedure (procedure)|:260686004 |Method (attribute)|=(16404004 |Induced (qualifier value)| OR 360020006 |Extirpation - action (qualifier value)| OR 7800008 |Punctate (qualifier value)|)

Note: Even if Induced and Punctuate terms are not currently used as destination of any attribute in Snomed CT, this expression constraint allows to validate the post-coordination of procedures that use these methods

In addition to validating the subset we should also ask ourselves again if there is some expression that could be used to group terms, and by doing so, simplify the expression.

Conclusion

As the examples show, these subsets' visualizations could help clinical experts in the subset definition, validation, and curation. Extensional subsets may seem easy to define, but could led to potential problems if the hierarchy is not taken into account when defining the subset. Even if (arguably) one of the advantages of defining extensional subsets is to limit the possible inputs in a form, official provided subsets should always try to include all and every term useful for the subset to ease interoperability. When implementing the subset in a given organization is always better to further refine a subset than to extend it with terms not originally included in it.

martes, 23 de enero de 2018

Having fun with Snomed expression constraints (and learning something in the meantime)

This article wants to be a fun introduction to the Snomed expression constraint language in order to show its capabilities. This article assumes no prior knowledge of the expression constraint language, so it will start with a little introduction to it (if you already know about the Snomed expression constraint language should be safe to jump to point 2). In this post both IHTSDO Snomed browser and VeraTech SNQuery will be used.

Snomed Expression Constraint Language basics

In this section a few operators from the Snomed Expression Constraint Language will be explained. For a complete explanation of the Snomed Expression Constraint Language visit the official documentation.

Simple expression constraints

The following simple operators already provide great functionality for querying Snomed hierarchy:

Descendant of: The constraint is satisfied by all the transitive descendants of a given Snomed concept. This is denoted by the operator 'less than' (<). For example, the expression
< 64572001 | Disease (disorder) | provides about 74k concepts which includes concepts such as Anemia, Hematoma, or Inflammatory fibroid polyps of stomach

Descendant or self: Similar to 'descendants of' operator, the operator Descendant or self (denoted by two 'less than' symbols) is satisfied by all the transitive descendants of a given Snomed concept plus the concept itself. For example, << 11466000 |Cesarean section (procedure)| includes both the descendants of cesarean section and the cesarean section term itself.

There are more simple operators, but probably these two are the most used by far.

Refinements

A refinement in a Snomed expression allows the filtering the resulting set using one or more attribute constraints.

One of the great things about Snomed is that terms themselves can be defined by refining existing terms (see Snomed compositional grammar). E.g. Hepatitis A (40468003) is a disease (64572001) found (363698007) at the liver structure (10200004) with a inflammation (23583003) morphology (116676008) caused by (246075003) Hepatitis A virus (32452004). Note that these expressions contain both clinical terms such as hepatitis A or disease, but also attributes such as associated morphology and finding site, which have their own Snomed codes. These attributes can be used to refine defined sets.

Attribute refinement restricts the meaning set of clinical meanings to those satisfying the refinement condition. Similarly to the Snomed compositional grammar, a 'colon' (':') is used in the expression.

As an example, pulmonar diseases could be defined as <64572001 |Disease (disorder)| : 363698007 |Finding site| = 39607008 |Lung structure (body structure)|

Note that these attributes have a "direction": the above expression returns all the diseases whose finding site is the lung structure. There will be times where we want to select the target term of a relationship and constraint the source. We can achieve this by using the 'reverse' operator ('R'). E.g with <64572001 |Disease (disorder)| : 246075003 |Causative agent| = 49872002 |Virus (organism)| we can obtain all the diseases caused by viruses, and by reversing the attribute with an expression such as <49872002 |Virus (organism)| : R 246075003 |Causative agent| = <64572001 |Disease (disorder)| the subset of the viruses that cause diseases can be obtained.

There are more ways to refine an Snomed expression, but with these basic ones we can start 'playing' with Snomed.

Having Fun

With these operators and refinements in mind, we can start navigating Snomed without a deep knowledge of the underlying Snomed conceptual model, i.e. what attributes are valid in each hierarchy.

Finding what to look for

We defined an application that needed a list of cancer diagnosis (coded in ICD-10), but also a location where these cancer were found. Could we use Snomed to provide us with a (tentative) set of terms to fill this field?

Even if you don't know Snomed conceptual model, you probably know examples of what you are looking for. I will use 'lung cancer' as an example.

Navigating Snomed

Searching the term in the Snomed browser allows us to dig the different terms that make up the term meaning. 'Lung cancer' is a synonym of 'Primary malignant neoplasm of lung (disorder)', with term
9388000.

In the concept details we can examine the Expression' tab and then look at 'Expression from Stated Concept Definition'. That expression precisely defines the Snomed term from other existing Snomed terms. In this case we want to know which attributes are valid, in our case 'disorders'. By that definition, lung cancer (93880001) is a disease (64572001) found (363698007) at the lung structure (39607008) with a malignant neoplasm (86049000) morphology (116676008). We can generalize that expression by navigating Snomed hierarchy. For example, we could ask for all the primary cancers that have a finding site in any body structure <64572001 |Disease (disorder)| :{363698007 |Finding site| = <123037004 | Body structure (body structure) |, 116676008 |Associated morphology| = 86049000 |Malignant neoplasm, primary (morphologic abnormality)|} which results in a subset of ~600 terms.
An alternative is to look for all primary, secondary, or other cancers with a finding site in any body structure <363346000 |Malignant neoplastic disease (disorder)| :363698007 |Finding site| = <123037004 | Body structure (body structure) | which contains ~3700 terms.

In addition to give us this subset list, SNQuery allows us to simplify the expressions in order to reduce expression processing time. These simplified queries return the same terms (same subset) but contain more precise codes that ease the expression or makes it more precise and clearer. As an example, the expression <64572001 |Disease (disorder)| :{363698007 |Finding site| = <123037004 | Body structure (body structure) |, 116676008 |Associated morphology| = 86049000 |Malignant neoplasm, primary (morphologic abnormality)|} can be simplified as <372087000 |Primary malignant neoplasm (disorder)|:{363698007 |Finding site| = <123037004 |Body structure (body structure)|, 116676008 |Associated morphology| = 86049000 |Malignant neoplasm, primary (morphologic abnormality)|}. First expression needs more than 5 seconds to be processed, while the second expression is about 150 milliseconds

Going in reverse

Once we have found a suitable expression we can reverse it to get the desired results. In our case, instead of looking for 'all primary, secondary, or other cancers with a finding site in any body structure' we will reverse it to express 'all body structures thar are finding sites in primary, secondary, or other cancers'. This subset can be expressed as <91723000 |Anatomical structure (body structure)|: R 363698007 |Finding site (attribute)| =<363346000 |Malignant neoplastic disease (disorder)| and contains ~880 terms

We could use this subset list as a first approach to populate our user interface and add Snomed codes into the mix. If we have a ICD code we could potentially use the official Snomed mapping and use it to validate the other fields (in this case, diagnosis with their location).