AIM III - Part 1: Our new processing engine
We recently created a new AI and processing engine: AIM III. AIM III will be implemented in DataMapper to make it faster and more powerful than ever. In the future, you can also expect to see AIM III introduced for the rest of our solutions.
Let’s consider why it was needed and what the new processing engine does for DataMapper specifically. In part 2 of this series, we’ll consider another aspect of AIM III: Its new AI engine.
Why DataMapper needed a new processing engine
DataMapper is an automated data-discovery tool that finds, classifies and monitors personal and sensitive information across all company storage locations and emails; flagging data that poses a potential risk.
To scan all that data quickly from multiple data sources, we developed an engine with advanced data processing capabilities.
DataMapper needed a processing engine that would allow companies to:
- Scan their email and storage locations for documents and images with the highest possible security measures.
- Get fast and consistent results.
Additionally, we wanted to:
- Keep DataMapper cost-effective.
- Prepare for a multi-tenancy architecture, i.e., store data in a customer’s own tenant.
- Scale DataMapper with a global infrastructure.
- Create an architecture we could carry over to the rest of our product portfolio in the future; one that could handle infinite individual user scans at the same time, all over the world.
How we built it: Technical details
AIM III was conceived by Razvan Ursachi, Andy Bosyi and rest of the Safe Online team with advice from the Danish Alexandra Institute.
It is an event driven architecture that makes it easy to analyze the behavior of processing flows over time and auto-scale when necessary.
We used Apache Airflow for data orchestration. Apache Airflow is an open-source tool to author, schedule, and monitor workflows. It is one of the most robust platforms used by Data Engineers for orchestrating workflows or pipelines. It lets us easily visualize our data pipelines’ dependencies, progress, logs, code, trigger tasks, and success status.
We changed the airflow executor from local to Celery to add scalability, multiprocessing, etc. Celery is an open-source asynchronous task queue or job queue which is based on distributed messages passing on operations in real time.
When high volumes of data are processed, we scale the system by automatically allocating more resources when needed.
We combined these elements with an event dispatcher triggered by Azure Event Hubs (AEH). AEH was chosen as it simple, trusted, and scalable and lets us stream millions of events per second from any source to build a dynamic data pipeline and immediately respond.
All AI processes will auto-scale, powered by a Kubernetes environment. Kubernetes is an open-source container-orchestration system for automating computer application deployment, scaling, and management. Kubernetes was originally designed by Google.
As a database we use Azure Cosmos. Cosmos DB is a fully managed NoSQL database for modern app development. It delivers single-digit millisecond response times, and automatic and instant scalability guarantee speed at any scale and enterprise-grade security.
We then developed a Service Integration Module (SIM). The Service Integration Module was needed to adapt DataMapper and our other services to third party connectors such as Outlook, SharePoint, OneDrive, Amazon S3, Google Drive, Salesforce, HubSpot, etc. in order to fetch file structures and files.
We have deployed our own AI within Kubernetes, based on our own dataset, Archii’s AI, SpaCy and Azure Cognitive Service. We can now easily add additional AI services like Microsoft Information Protection, Google DLP, StoredIQ by IBM and others to ensure scalability and best of breed. The new architecture allows us to scale and automise all processes during onboarding for new customers in PrivacyHub/ DataMapper.
Benefits for our users
The new AIM III processing engine allows the following improvements in DataMapper:
Many users can simultaneously start scans and get results quickly
We can now manage load/bandwidth and priorities between customers
The system automatically manages itself without human interaction
Configurations like resource allocation and security can more easily be handled by the user
After a user scans their storage locations with DataMapper, they can now view more details about each sensitive file found, including:
Meta data: File location, file type, title, etc.
Risk findings (high risk numbers, document category, high risk keywords, risk names)
Highlighted locations of any risk-data within each document
See for yourself how fast it now is to gather all your company’s sensitive data from multiple storage locations, organize it, then monitor it from one dashboard.