Skip to main content

AIM III

Part 1: Our new processing engine

We recently created a new AI and processing engine: AIM III 

AIM III will be implemented in DataMapper to make it faster and more powerful than ever. In the future, you can also expect to see AIM III introduced for the rest of our solutions.  

Let’s consider why it was needed and what the new processing engine does for DataMapper specifically. In part 2 of this series, we’ll consider another aspect of AIM III: Its new AI engine. 

Why DataMapper needed a new processing engine

DataMapper is an automated data-discovery tool that finds, classifies and monitors personal and sensitive information across all company storage locations and emails; flagging data that poses a potential risk.  

To scan all that data quickly from multiple data sources, we developed an engine with advanced data processing capabilities.  

DataMapper needed a processing engine that would allow companies to: 

  • Scan their email and storage locations for documents and images with the highest possible security measures.  
  • Get fast and consistent results. 

Additionally, we wanted to:  

  • Keep DataMapper cost-effective. 
  • Prepare for a multi-tenancy architecture, i.e., store data in a customer’s own tenant. 
  • Scale DataMapper with a global infrastructure.   
  • Create an architecture we could carry over to the rest of our product portfolio in the future; one that could handle infinite individual user scans at the same time, all over the world.  

Want more free data privacy tips?

Get the latest data privacy management news, trends and expert tips delivered straight to your inbox.

    How we built it: Technical details

    AIM III was conceived by Razvan Ursachi, Andy Bosyi and rest of the Safe Online team with advice from the Danish Alexandra Institute.  

    It is an event driven architecture that makes it easy to analyze the behavior of processing flows over time and auto-scale when necessary. 

    We used Apache Airflow for data orchestration. Apache Airflow is an open-source tool to author, schedule, and monitor workflows. It is one of the most robust platforms used by Data Engineers for orchestrating workflows or pipelines. It lets us easily visualize our data pipelines’ dependencies, progress, logs, code, trigger tasks, and success status.   

    We changed the airflow executor from local to Celery to add scalability, multiprocessing, etc. Celery is an open-source asynchronous task queue or job queue which is based on distributed messages passing on operations in real time.   

    When high volumes of data are processed, we scale the system by automatically allocating more resources when needed.

    We combined these elements with an event dispatcher triggered by Azure Event Hubs (AEH). AEH was chosen as it simple, trusted, and scalable and lets us stream millions of events per second from any source to build a dynamic data pipeline and immediately respond.   

    All AI processes will auto-scale, powered by a Kubernetes environment. Kubernetes is an open-source container-orchestration system for automating computer application deployment, scaling, and management. Kubernetes was originally designed by Google.   

    As a database we use Azure Cosmos. Cosmos DB is a fully managed NoSQL database for modern app development. It delivers single-digit millisecond response times, and automatic and instant scalability guarantee speed at any scale and enterprise-grade security.  

    We then developed a Service Integration Module (SIM).  The Service Integration Module was needed to adapt DataMapper and our other services to third party connectors such as Outlook, SharePoint, OneDrive, Amazon S3, Google Drive, Salesforce, HubSpot, etc. in order to fetch file structures and files. 

    We have deployed our own AI within Kubernetes, based on our own dataset, Archii’s AI, SpaCy and Azure Cognitive Service.  

    We can now easily add additional AI services like Microsoft Information Protection, Google DLP, StoredIQ by IBM and others to ensure scalability and best of breed.   

    The new architecture allows us to scale and automise all processes during onboarding for new customers in PrivacyHub/ DataMapper. 

    Benefits for our users

    The new AIM III processing engine allows the following improvements in DataMapper:

    Many users can simultaneously start scans and get results quickly 

    We can now manage load/bandwidth and priorities between customers 

    The system automatically manages itself without human interaction  

    Configurations like resource allocation and security can more easily be handled by the user

    After a user scans their storage locations with DataMapper, they can now view more details about each sensitive file found, including: 

    Meta data: File location, file type, title, etc.

    Risk findings (high risk numbers, document category, high risk keywords, risk names)

    Highlighted locations of any risk-data within each document

    See for yourself how fast it now is to gather all your company’s sensitive data from multiple storage locations, organize it, then monitor it from one dashboard.

    Learn more and get a trial version of DataMapper for free → 

    Sebastian Allerelli

    Governance, risk, and compliance specialist