Entity Intelligence is the missing link in AI

Automatic identification and resolution of entities within unstructured data sources is crucial to understanding and utilizing data for use in AI systems. Historically this has been difficult to do, and even harder to trust the results. Agolo’s hybrid, human-in-the-loop approach for discovering and compiling entity intelligence, ensures that its best-of-breed, entity graph technology delivers trustworthy, production grade outputs for mission-critical AI use cases.

Terry Busch - Former Technical Director, Machine-Assisted Analytic Rapid-Repository System (MARS), Defense Intelligence Agency (DAI)

Feb 20, 2024

Introduction

Why is data accuracy so paramount to defense and intelligence? One of the most sacred pillars of intelligence is that the US government must stand by, and take responsibility for our assessments. Our errors can be costly. As we continue to approach a more autonomous world where we have the real opportunity to improve on the way we conduct intelligence, we are faced with a continuing challenge. How do we derive content from unstructured data without taking risks in our assessments? Here, we cannot rely on good statistical probability alone. We have to use all of our capacity to ensure we’re providing a comprehensive and thorough understanding of our data. I’m not saying we have to be 100% accurate, that’s nearly impossible, but to be effective we must reach a point where we trust technology to be accurate. Even though we’ve made advances, we’re not there yet.

Taming unstructured data

Unstructured data contains some of the most important and often most problematic sources for AI systems. New sources of data typically require extraordinary and costly data management practice. Until recently, wrangling and extracting valuable information at scale from unstructured data took expensive and laborious effort. Our historical data tends to lack consistency, has poor provenance or metadata, and is replete with inconsistencies.

In these scenarios, we typically rely on very busy domain subject matter experts (SMEs) to design domain specific ontologies, schemas, and controlled vocabularies. Unfortunately, I’ve found these solutions to be time consuming, static, and inflexible. This leads to a ton of error when performing extraction routines. The F-scores, our best way to measure accuracy and precision, have failed to come close to the standard we need for intelligence analysis. None of these approaches were bad per se. In fact, they all got us closer, but in the end, they likely have achieved little more than improve our ability to search data meaningfully.

Natural language processing approaches tend to be difficult to implement as well. Most of the text-based information I’ve encountered contains different writing or speaking styles, typos, and incomplete, ambiguous or poorly expressed ideas. Take for example all the ways people can refer to the U.S. President Joe Biden: Joseph Biden, Joseph, Biden, Joe, Scranton Joe, he, him, the President, the Commander in Chief, Джо Байден, 조 바이든과, 乔·拜登. Aliases, nicknames, code names, pronouns, and non-compatible alphabets, e.g. Latin vs Cyrillic vs logographic languages, all compound the challenge of autonomous information extraction. The previous generations of Natural Language Processing tools were brittle, overly simplistic, and did a poor job of handling noisy data. Ultimately, our F-scores when applied to early NLP solutions, didn’t move the needle toward the answer. NLP may have improved our search technology, but it was not the holy grail. The great LLM revolution of 2023 is improving our ability to mine texts but requires huge investment and typically a mind-blowing amount of data to improve scoring.

Beyond these approaches, we’ve tagged things, we’ve tried synthetic data, we’ve battled with the challenges of confirmation bias and model drift introduced by traditional methods. We’ve spent millions trying to coerce our unstructured data into structured data, resulting in countless costly false positives and negatives. I have the scars, and I’ve also had success when using more structured data and other AI-based technologies. Finding a tank on a satellite image works. The problem is, that same model will go right past the UFO.

All things considered, the time is right for a fresh approach.

Unsupervised vs supervised classification: why not both?

Human/machine interaction is critical to shape and improve results. In the past, there have been problems with trust and accuracy surrounding unsupervised classification. Unsupervised classification is really good at finding clusters in our unstructured text but often they are meaningless or inadvertent. At the same time, supervised classification has been historically slow, requiring a strong human hand, and has failed to fully leverage technological advances. Many times, supervised classification has over-rotated away from, and ultimately rejected, newer technologies.

When problems are unpreventable with either approach, why not use both? A hybrid solution is an interesting and clearer path forward. This allows us to bring perspective from our organizing technologies (ontologies, tagging, etc.) and, simultaneously, lets the machine take a stab. The comparative results lend themselves well to long-proven statistical analysis tools where we get to test and retest for accuracy and then improve our models. Hybrid classification models aren’t new, but they aren’t used as heavily in unstructured data as they should be.

While an AI-powered automated system can perform many entity disambiguation and entity context extraction processes, the most powerful, advanced systems enable “human-in-the-loop” capabilities to augment the automated process. With it, we can maintain our existing approaches and continuously improve our model and the output.

Human experts, for instance, can merge entities into or split them from other entities, confirm entity aliases and more. Likewise, subject matter experts may also establish “authoritative entities.” This last step, with humans in the loop, is critical to ensuring accuracy. And with every instance of human curation, the system gets smarter, so it won’t make the same mistakes when it ingests subsequent data.

Entity Intelligence: the missing link

The thing we want from a multi-faceted or hybrid approach is a good statistical output of the result: the entity. In other words, we want something we can inspect and examine for ground truth. Knowledge graphs have been an answer for understanding entities and their relationships. These graphs rely on highly curated, salient, authoritative knowledge about entities. But what if we want to understand and gather comprehensive intelligence on entities in a precise manner in near real-time, as we discover it? For that we need a more flexible, mostly autonomous, human-in-the-loop-enabled approach.

One of the firms I love to work with on this is Agolo.

Agolo’s entity intelligence solution introduces the dynamic and flexible approach of entity graphs, paired with human-in-the-loop curation. Entity graphs are domain-specific, highly contextualized models of an organization’s most relevant entities and their relationships. In unstructured data, in particular, we need to understand relationships between entities in a given domain. But the entity graph must be traceable to the source and well-documented. Entity graphs retain the source information while making fabulous discoveries drawing from the human side (ontologies, domain expertise, etc.) and the machine side where a plethora of data science correlation and AI tools await.

Agolo also addresses ghost entities well. These are great for anomalies, weak signals, and those infamous known-unknown discoveries. Say you’re working in a military intelligence unit and a new shadow figure emerges in intelligence briefings. There’s initially no record of them, so they are tracked as a new “ghost identity.” After ingesting tens or hundreds of data points, though, a fuller picture comes together and the system can develop an informed view of that person and promote the “ghost identity” to an “authoritative identity.” Equally important, the intelligence analyst can monitor these decisions, and step in and make adjustments – by, for example, merging two identities that the system deemed distinct. This “human-in-the-loop” function is critical to making the data more accurate. In addition, such adjustments train the AI to make better decisions next time.

Next, It is often hard, given any technique, to bind entities across languages. Add in complexities such as dialect, colloquialisms, cultural references, and jargon, and things get pretty complex. Using an entity graph approach helps to drill down on the relevant content in ways I haven’t seen before.

Entity graphs can also support downstream use cases. First, and foremost, Agolo’s entity graphs have proven to be a good tool for event detection, especially in emergent areas of science and technology. Entity graphs are also great for fraud detection efforts. Operating in the context of emerging issues, tampering, and malign influence, it’s critical to be able to define relationships in advance when designing the system. With Agolo, you can leverage advanced technologies to discover critical relationships you didn't even know to look for. Agolo’s ability to detect patterns, enigmas, and rapid change using a hybridized methodology give it a great leg up in making connections between disparate sources faster and more accurately. Similarly, an entity graph can effectively monitor real-time global events and trends, especially across large diverse repositories of data. As a long-time OSINT practitioner, entity intelligence really helps us deal with the deluge of content, where we have to spend countless hours disambiguating, validating, and cleaning data to even begin our journey into sense making.

Conclusion

To advance forward, AI systems require accurate inputs, many of which must come from our abundance of unstructured data. To do this, we must collect intelligence around the entity: intelligence beyond a simple, binary resolution, that is traceable, explainable, and collects the context required for deep semantic understanding and analysis of complex relationships as new information is discovered. Such a solution requires the best technology available paired with a human-in-the-loop, hybrid approach. After all, we accept human error at a far higher rate than we can or should afford machines. Current AI can make recommendations that approximate reasoning, which speeds up the process, but having a human-in-the-loop is truly essential to guarantee trustworthy and responsible AI.

In the end, this relationship between human and machine only compliments our existing business practices, by creating efficiencies in areas where humans lose so much of their time. The true benefit then, if we can learn to trust the inputs and outputs of machines, is giving humans back their most critical resource: time.

What’s key here is that entity intelligence helps us get to ground truth accurately and quickly without the need to pre-define our domains—and gives us a fresh approach on data discovery and knowledge.