iKnow

by InterSystems Corporation

0 reviews

790

Views

Details

Releases (3)

Reviews

Issues

Articles (1)

The iKnow Natural Language Processing technology was originally developed in Belgium and then acquired by InterSystems in 2010. In February 2020, InterSystems published the technology to open source, expanding the possible use cases for it beyond embedded use from the InterSystems IRIS Data Platform. iKnow is open to community contributions to enhance the engine, language models and interfaces for use in text exploration, information extraction and machine learning use cases.

What's new in this version

This release rolls up a large number of changes applied since the first full v1.0 release:

Extended support for semantic attributes
Many improvements to the language models, especially English, Japanese and Czech
Enhancements to the CI/CD procedures' speed and reliability
Enhancements to user and developer documentation
Various bugfixes to previously reported issues

⚠️ the output format for sentence attributes with property values has changed slightly - see below for details

Semantic Attributes

The v1.1 release significantly expands iKnow's ability to identify semantic attributes in natural language text, and in particular enhances support for measurements, time and certainty. iKnow now recognizes more markers in the various supported languages and has more accurate expansion rules to identify the affected span within each sentence. Check the wiki for more details on which attributes are supported in which language.

New since v1.0 is the introduction of a Certainty attribute, which has an attribute property expressing the level of certainty. A level of 9 means an expression of absolute certainty and a level of 1 means very low confidence. While you can specify (or override) an initial level of certainty with the attribute marker definition (e.g. in the User Dictionary), rules processing may modify the value, e.g. in the context of a Negation Attribute.

This release also introduces three new Generic attributes, which can be used by developers to tag use case specific attributes not covered by the built-in attribute types. Developers can add their own marker terms for these to leverage attribute expansion to flag syntactically "affected" portions of a sentence. A basic set of expansion rules are included for these generic attributes.
For example, we've helped customers in the healthcare industry add marker terms such as "mother", "brother", etc. so that mentions of "family history" can be identified in the text: "Patient mentioned mother suffered a stroke 10y ago, but denied experiencing chest pain himself"

CI/CD Pipeline

The Continuous Integration / Continuous Deployment pipeline for this repository is implemented through GitHub Actions, and now includes standard unit tests as well as reference tests against a gold standard to ensure the highest quality output.

Compatibility Notes

We made a change to the Sentence attribute structure emitted by the iknowpy module. In v1.0, the fixed number of properties (value, unit, value2, unit2) has been converted to a list of pairs, enabling a more flexible way of passing sentence attribute properties:

    struct Sent_Attribute:
           Attribute type "type_"
           size_t offset_start "offset_start_", offset_stop "offset_stop_"
           string marker "marker_"
           string value "value_", unit "unit_", value2 "value2_", unit2 "unit2_"
           Entity_Ref entity_ref
           Path entity_vector

was changed to :

   ctypedef vector[pair[string, string]] Sent_Attribute_Parameters
   struct Sent_Attribute:
           Attribute type "type_"
           size_t offset_start "offset_start_", offset_stop "offset_stop_"
           string marker "marker_"
           Sent_Attribute_Parameters parameters "parameters_"
           Entity_Ref entity_ref
           Path entity_vector

Existing code should change as follows :

sent_attribute['value'] = sent_attribute['parameters'][0][0]
sent_attribute['unit'] = sent_attribute['parameters'][0][1]
sent_attribute['value2'] = sent_attribute['parameters'][1][0]
sent_attribute['unit2'] = sent_attribute['parameters'][1][1]

iKnow

iKnow is a library for Natural Language Processing that identifies entities (phrases) and their semantic context in natural language text in English, German, Dutch, French, Spanish, Portuguese, Swedish, Russian, Ukrainian, Czech and Japanese. It was originally developed by i.Know in Belgium, acquired by InterSystems in 2010 to be embedded in its Caché and IRIS Data Platform products. InterSystems published the iKnow engine as open source in 2020.

This readme file has everything you need to get started, but make sure you click through to the wiki for more details on any of these subjects.

iKnow
Using iKnow
Understanding iKnow
Building the iKnow Engine
Contributing to iKnow

Using iKnow

From Python

The easiest way to see for yourself what iKnow does with text is by giving it a try! Thanks to our Python interface, that only takes two simple steps:

Use pip to install the iknowpy module as follows:
```
pip install iknowpy
```

From your Python prompt, instantiate the engine and start indexing:

import iknowpy
engine = iknowpy.iKnowEngine()
show supported languages
print(engine.get_languages_set())
index some text
text = 'This is a test of the Python interface to the iKnow engine.'

engine.index(text, 'en')
print the raw results
print(engine.m_index)
or make it a little nicer
for s in engine.m_index['sentences']:

for e in s['entities']:

print('<'+e['type']+'>'+e['index']+'</'+e['type']+'>', end=' ')

print('\n')

If you are looking for another programming language or interface, check out the other APIs. For more on the Python interface, move on to the Getting Started section in the wiki!

From C++

The main C++ API file is engine.h, defining the class iKnowEngine with the main entry point:

index(TextSource, language)

After indexing all data is stored in iknowdata::Text_Source m_index. “iknowdata” is the namespace used for all classes that contain output data. Fore more details, please refer to the API overview on the wiki.

From InterSystems IRIS

For many years, the iKnow engine has been available as an embedded service on the InterSystems IRIS Data Platform. The obvious advantage of packaging it with a database is that indexing results from many documents can be stored in a single repository, enabling corpus-wide analytics through practical APIs. See the iKnow documentation for IRIS or browse the InterSystems Developer Community’s articles on setting up an iKnow domain, browsing it and using iFind (iKnow-powered text search)

The InterSystems IRIS Community Edition is available from Docker Hub free of charge.

From Different Platforms

Since version 1.3, a C-interface is available, enabling communication with the iKnow engine in a JSON encoded request/response style:

const char* j_response;
iknow_json(R"({"method" : "index", "language" : "en", "text_source" : "Hello World"})", &j_response);

Most API functionality is available in a serialized json format.

Understanding iKnow

Entities

iKnow identifies phrase boundaries that define Entities, entirely based on the syntactic structure of the sentences, rather than relying on an upfront dictionary or pretrained model. This makes iKnow well-suited for initial exploration of a new corpus.
iKnow Entities are not Named Entities in the NER sense, but rather the word groups that need to be considered together, representing a concept or relationship as coined by the text author in its entirety. The following examples clearly show the importance of this phrase level to fully capture what the author meant:

iKnow Entity	Meaning
Dopamine	small molecule
Dopamine receptor	drug target
Dopamine receptor antagonist	chemical drug
Dopamine receptor gene	gene, molecular sequence
Dopamine receptor gene mutation	physiological process

iKnow will label every entity with a simple role that is either concept (usually corresponding to Noun Phrases in POS lingo) or relation (verbs, prepositions, …). Typical stop words that have little meaning of their own get categorized as PathRelevant (e.g. pronouns) or NonRelevant parts, depending on whether they play a role in the sentence structure or are just linguistic fodder.

In the following sample sentence, we’ve highlighted concepts, relations and PathRelevants separately.

Belgian geuze is well-known across the continent for its delicate balance.

CRC’s

As of v1.4, the iKnow engine now also produces Concept-Relation-Concept clusters (aka CRC’s)

Attributes

Beyond this simple phrase recognition, iKnow also captures the context of these entities through semantic attributes. Attributes label spans (of entities) within a sentence that share a semantic context. Most attributes start from a marker term and are then, through linguistic rules, expanded left and right as appropriate per the syntactic structure of the sentence. iKnow’s main contribution is in this fine-grained expansion, which has been shown to be more accurate than many ML-based techniques.

iKnow supports the following attribute types:

Negation: iKnow tags all entities participating in a negation, as opposed to an (implied) affirmative context.

After discussing his nausea, the [patient didn’t report suffering from chest pain, shortness of breath or tickling].
Sentiment: based on a user-supplied list of marker terms, iKnow will identify spans with either a positive or negative sentiment (through separate attributes). Overlapping negation attributes will reverse the sentiment in some language models.

[ I liked the striped pijamas], but the [slippers didn’t really fit with it ].
Measurements, Time, Frequency and Duration: all entities “participating” in an expression of something measurable or time-related will be tagged, enabling efficient recognition of facts in long stretches of natural language text.

Upon exam [two weeks ago] the [patient’s weight was 146.5 pounds].
Certainty: this attribute is a work in progress. See the corresponding wiki section for more details.

Some attributes are not available for all languages yet. See the wiki section for more details.

How it works

Some InterSystems-era resources on how iKnow works:

A recent introductory video
A not-so-recent playlist on our video channel
A fun animation of our unique bottom-up approach in Japanese and Russian (English version embedded here)

Building the iKnow Engine

The source code for the iKnow engine is written in C++ and includes .sln files for building with Microsoft Visual Studio 2019 Community Edition and Makefiles for building in Linux/Unix.

Please refer to this wiki page for more on the overall build process.

Contributing to iKnow

You are welcome to contribute to iKnow’s engine code and language models. Check out the Wiki for more details on how they work and the Issues and Projects sections for any particular work on the horizon.

Made with

Python

Repository Documentation License

Version

1.1.005 Jul, 2021