Skip to main content

Better accuracy when finding risk

We are proud to announce that we at Safe Online have now integrated Azure OpenAI’s Large Language Models (LLMs) to DataMapper to ensure advance detection capabilities when finding sensitive data.

Why did we do this?

By combining these technologies we are focusing on transforming textual context into numerical vectors (embeddings). These technologies allows for a nuanced understanding of the context surrounding sensitive numerical data, such as document numbers, without the need for full-text generation. The aim was to effectively manage and sift through extensive unstructured data to pinpoint potential risk numbers containing sensitive information.

Technological Framework

The approach encompasses a multifaceted technological suite:

  • Optical Character Recognition (OCR): Converts visual documents into text, utilizing tools like Textract, NLTK, spacy, and Microsoft Azure AI’s OCR.
  • Pattern Recognition (RegEx): Employs Regular Expressions to identify and validate specific numerical patterns that may represent sensitive information.
  • Context Analysis and Retrieval: Adopts a retrieval algorithm akin to Retrieval Augmented Generation (RAG) to discern the context surrounding risk numbers, enhancing the accuracy of identification.
  • Large Language Model Utilization: Employs Azure OpenAI’s LLMs for converting textual contexts into embeddings, focusing on contextual understanding rather than text generation.
  • Machine Learning Application: Implements straightforward machine learning models for binary decision-making processes, optimizing for efficiency and cost-effectiveness in verifying risk numbers.

What this means for our customers

By leveraging this sophisticated combination of OCR, LLMs, text embedding, and machine learning techniques, DataMapper has significantly enhanced its capability to identify and manage sensitive information within unstructured data. This feature is valid for all our customers across sectors and nationalities. The adoption of these technologies has not only made the process more efficient but we have also managed to do this in a cost-effective way ensuring competitiveness and scalability of DataMapper.

Andy Bosyi

Lead Data Scientist at Safe Online ApS and Cofounder of MindCraft.ai

GUIDE

How to handle sensitive personal data

GUIDE

How to find personal data with datamapping tool

GUIDE

How to prepare for a data audit