The short answer: DataMapper utilises a combination of technologies such as Text Extraction, OCR, Language Detection, Indexing Keywords, ML and LLM to scan files for sensitive content efficiently. It filters out irrelevant data and only presents files that contain actual sensitive information, ensuring GDPR compliance and data security without the guesswork.
Why finding sensitive expressions is so complex
Identifying sensitive information across a company’s data landscape is never straightforward—regardless of whether you rely on manual searches or an automated tool. It’s a complex task for a variety of reasons. Here are just a few of the reasons why:
-
The volume of data
Manually reviewing millions of files, emails, attachments, and documents is slow, inconsistent, and nearly impossible to scale. The sheer volume alone makes traditional methods unfit for the task. -
A growing and evolving list of sensitive expressions
Privacy regulations like GDPR are comprehensive and continuously updated. New risk terms emerge regularly, and each organisation may have its own definition of what’s considered sensitive or business-critical. -
Sensitive data can appear anywhere
It’s not just buried in text documents—sensitive expressions can be hidden in screenshots, scanned contracts, handwritten notes, or photos of ID cards. Without OCR and automation, every image would have to be reviewed manually, which is both time-consuming and error-prone. -
Context is everything
A word or number becomes sensitive only when it relates to a person. For example, “COVID” or “Muslim” on their own aren’t necessarily sensitive, but in a sentence like “She was dismissed after revealing she had COVID”, the context makes it sensitive under GDPR. -
Language and format complexity
Sensitive information can appear in multiple languages and follow different regional formats. A Danish CPR number looks different from a US social security number, and even the same word can carry different meanings depending on the language. - Technology limitations Scanning complex documents like Excel sheets with multiple rows and columns can be particularly challenging. OCR technology typically reads text vertically, similar to traditional reading patterns, but rows and columns often contain data arranged in ways that don’t align with this vertical scanning logic. Consequently, additional logic and sophisticated processing are necessary to accurately interpret and extract sensitive information from these structured documents
To address these challenges, we have developed DataMapper.
How does DataMapper search for sensitive data?
Unlike other search methods that rely on basic keyword matching or rule-based scanning, DataMapper follows a sophisticated, AI-powered process to extract, analyse, and validate sensitive data across millions of documents—not in days or weeks, but in hours or even minutes. Many other data discovery solutions can detect certain types of information using pattern recognition or metadata filtering. However, they often fall short when it comes to image-based content, multilingual data, and understanding context. This leads to missed risks or overwhelming false positives, giving teams more noise than value.
DataMapper takes a different approach—using a combination of advanced technologies to deliver results that are not only fast, but highly accurate. It surfaces only the truly sensitive data buried within millions of documents, cutting through the noise and reducing false positives.
- Text Extraction
- Optical Character Recognition (OCR)
- Language Detection
- Indexing Keywords
- Machine Learning (ML) and Regular expressions (RegEX)
- Large Language Model (LLM)
Did you know that GDPR violations can result in fines of up to 20 million euros or 4% of the company's global annual turnover, whichever is higher
- European Commision
1. Text Extraction
The first step in identifying sensitive data is to get to the text—no matter where it lives. DataMapper starts by extracting all readable text from all scanned files, whether it’s a standard text file or an image-based format. If the document already contains selectable text (like a mail or Word file), DataMapper extracts it directly. But if the document is image-based—for example, a scanned contract, a photo of an ID card, or a screenshot of an email—DataMapper automatically moves to the next step: OCR (Optical Character Recognition).
This seamless handoff ensures that no sensitive data is left behind simply because it’s hidden inside an image.
2. Optical Character Recognition (OCR)
All image-based files will then be converted by OCR technology into searchable text so they can be analysed like any text-based document. Without this step, any image-based sensitive data would remain invisible to a traditional search.
3. Language Detection
Once the text is extracted, DataMapper identifies the language—a critical step that determines how the content is processed moving forward. This is where the system decides which machine learning and AI models to apply, based on the language detected. Think of it as placing each file on the right processing conveyor belt:
- Words and numbers mean different things in different languages – “SSN” in English refers to a Social Security Number, while in Denmark it’s “CPR”.
- Some sensitive data formats are language- or country-specific – A national ID in one country might be completely irrelevant elsewhere.
By detecting the language up front, DataMapper ensures that each file is routed through the correct language-aware models, making subsequent keyword indexing, pattern recognition, and contextual filtering far more accurate. Skipping language detection would lead to a great number of false positives.
Get our Newsletter!
In our newsletter you get tips and tricks for dealing with privacy management from our founder Sebastian Allerelli.
When you sign up for our newsletter you get a license for one user to ShareSimple, which will give you a secure email in Outlook. This special offer is for new customers only, with a limit of one freebie per company.
4. Indexing Keywords
Once the text has been extracted and the language identified, DataMapper creates a complete index of every word and number across all scanned files. This ensures that no term slips through the cracks—every piece of data becomes searchable and measurable against a carefully built taxonomy of sensitive terms. This taxonomy is designed to align with privacy regulations like the GDPR and has been developed in collaboration with legal and compliance experts.
The taxonomy includes 3 categories of sensitive information:
- Personally Identifiable Data (PII) – e.g. name, date of birth, social security number
- Sensitive Personal Data – e.g. health information, trade union membership, sexual orientation
- Business-Critical Terms – e.g. contracts, budgets, intellectual property documents
The taxonomy is used as a predefined vocabulary of risk indicators and is continuously updated as privacy standards evolve.
But identifying keywords alone isn’t enough—we take it a step further by applying machine learning to validate patterns and reduce false positives.
5. Machine Learning (ML) and Regular expressions (RegEX)
Many types of sensitive data follow recognisable patterns. For example:
- A credit card number is always 16 digits long.
- A social security number (SSN) might follow a XXX-XX-XXXX format.
- An IBAN (international bank account number) has a country-specific structure.
DataMapper uses RegEx to detect these patterns—but pattern matching alone isn’t enough. This is where machine learning (ML) steps in. ML models help DataMapper understand context and validate what the patterns actually mean in the surrounding text. They differentiate between a real social security number and, say, a phone number that just looks similar.
For example: If “555-55-5555” appears in a document, a simple pattern search might flag it as an SSN. But if it’s actually just a mistyped phone number, the ML model picks up on that and avoids a false positive. In short: ML adds the intelligence that rule-based systems lack—ensuring structured data isn’t just detected, but correctly understood.
6. Large Language Model (LLM)
Even after machine learning and pattern recognition, context still matters. A phone number in a contact list? Probably not sensitive. But that same phone number in a medical report or HR file? That’s a different story. This is where Large Language Models (LLMs) come into play.
LLMs combined with data vectorisation allow DataMapper to go a step further than pattern recognition. They analyse the surrounding language and context to assess whether something is actually sensitive. To do this, DataMapper vectorises relevant text snippets and transforms them into a format that the LLM can understand. The model is trained to determine whether something is sensitive or not.
Real-world example:
“Ben Islam from finance, was a great co-worker. In his spare time he was a member of the lonely hearts dart club and was a great cook. I would recommend him for a promotion. By the way, his painting ‘bypass operation with COVID’ was amazing.”
In this case, the LLM analysis does not detect any sensitive content. Why? Because the bypass operation is mentioned in the context of art, not as a real medical event. Now compare that to this version:
“Ben Islam from finance… By the way, his recent bypass operation was done when he had COVID. It went amazing.”
Here, the LLM correctly detects sensitive content in the form of health information.
This is how LLMs reduce false positives and sharpen accuracy—helping companies avoid sifting through thousands of irrelevant documents. In minutes, you get a refined, accurate set of results that truly require attention—nothing more, nothing less.
Start your privacy cleanup with the big picture
A GDPR Risk report gives you a complete overview of the privacy risk in your company. The report is based on a scan with DataMapper.
The benefits of this search method
The search method of DataMapper has carefully been developed through years of experience from studying sensitive expressions. It is an approach that is a smarter, faster, and more compliant way of handling the complex task of identifying sensitive data. By combining multiple technologies into one streamlined process, it offers several tangible benefits:
Saves time and resources: Manual searching can take weeks or even months. DataMapper processes millions of files in hours or minutes, dramatically reducing the time spent on data discovery.
Reduces compliance risk: By ensuring that all types of sensitive information—PII, special category data, and business-critical terms—are properly identified, the system helps organisations stay compliant with GDPR, HIPAA, and other data protection laws.
Enhances accuracy: With the help of OCR, ML, and LLMs, DataMapper avoids common false positives and understands the context behind words and numbers. You get fewer false alarms—and more real insights.
Finds what traditional searches miss: Sensitive expressions hidden in images, scanned documents, or multilingual formats are often missed by standard search tools. DataMapper’s layered method ensures nothing slips through.
Adapts to your data: The built-in taxonomy is developed with legal experts and continuously updated. You can also customise it to match your specific needs—ensuring relevance for your industry, data types, and internal policies.
Scalable and automated: Whether you have ten thousand files or ten million, the system scales effortlessly and works across email, cloud storage, file servers, and more—with no manual setup required.
FAQ on this topic
1. Can’t I just use a normal search function to find sensitive data?
A normal search doesn’t look inside images, PDFs, or detect context. It also returns tons of false positives, making compliance harder.
2. How does DataMapper know what data is sensitive?
It uses predefined taxonomies, machine learning models, and LLMs to understand context and validate results.
3. Will DataMapper’s search slow down my system?
No, DataMapper runs in the background without disrupting workflows.
4. Does DataMapper search in cloud storage?
Yes, it scans data across cloud services, emails, local storage, and more.
5. How does this method of searching help with GDPR compliance?
It ensures that all sensitive personal data is identified and managed correctly, reducing compliance risks.
Read more
Sebastian Allerelli
Founder & COO at Safe Online
Sebastian is the co-founder and COO of Safe Online, where he focuses on automating processes and developing innovative solutions within data protection and compliance. With a background from Copenhagen Business Academy and experience within identity and access management, he has a keen understanding of GDPR and data security. As a writer on Safe Online's Knowledge Hub, Sebastian shares his expertise through practical advice and in-depth analysis that help companies navigate the complex GDPR landscape. His posts combine technical insight with business understanding and provide concrete solutions for effective compliance.