Automated Tagging – Classification of content using AI in SharePoint

The constantly evolving data landscape presents companies with the challenge of processing data in such a way that it can be found at any time by anyone who needs it without any effort. Due to a lack of structure and poor data management, the amount of data that needs to be managed is growing faster than it needs to be.

By using artificial intelligence, we are able to automate work processes in the areas of information and document management. Our solution for automatic keywording in SharePoint uses the following techniques for this:

  • OCR – (Optical Character Recognition)
  •  NLP – (Natural Language Processing)
  • Key phrase detection
  • Automatic text summary
  • Pattern recognition                                                   

In this blog post, we will show you the possibilities and functions of these techniques.

The Prerequisite for Automated Tagging: Qualitative Metadata

Imagine that 10 terabytes of data are being migrated in your company and then have to be keyworded manually. And then, of course, every time a new document is created, the corresponding metadata must be defined and entered manually. Who should do this? The author, who is already very busy and only enters the data half-heartedly? Or an IT employee with the necessary technical know-how but lacking specific specialist knowledge? Both options are just as expensive as they are inefficient and are a very good example of what AI solutions can be used for.

Data unification and improved data analysis are key benefits of automated tagging. The new process automatically analyzes the documents, extracts the metadata and then classifies them according to the defined taxonomy. It then looks like this:

The data can come from various data sources, such as your file share, company or data drive or from your Microsoft 365 environment. Digital documents scanned PDF documents or e-mails can be used for data analysis. The techniques mentioned at the beginning are then required for this. You can also read our article on this: Digital document management concept – from a confusing drive to modern digital document management.

Optical Character Recognition (OCR)

In order to make content from scanned documents or images both readable and evaluable for machines, the corresponding document is converted into characters with the help of OCR. For this purpose, the page structure is first examined by means of a layout analysis and images are separated from text. In the next step, blocks of text are broken down into individual sentences, which in turn are split into individual words and into individual letters. These individual letters and characters are now read in by the system and put back into their original order, with the difference that the system can now read and assign them. After this indexing, it is possible to find the contents of the document using a full-text search or to tag them automatically. 

Natural Language Processing (NLP)

The use of NLP technologies enables machines to understand and interpret natural human language using algorithms and rules. Through different methods of linguistics in combination with modern IT systems and AI, content can be analyzed and information extracted for further processing. Over time, the system independently adopts more and more patterns in order to be able to process individual questions and problems in a targeted manner. In order to be able to do this work, it is necessary for the system not only to understand individual words and sentences but also complex text connections and facts.

Key phrase detection

Keyword extraction automatically identifies keywords that best describe the topic or content of a document. The process of reading out and extracting the relevant properties (metadata) is fully automated, which means that manual entry of metadata is no longer necessary. This collected metadata can then be output or queried anywhere within your system, such as in the full-text search of your Digital Workplace.

Automatic Text Summary

With a text summary, texts can be automatically shortened to a specified number of words without changing the content of the message. Since the manual creation of different text versions for different purposes is usually very time-consuming, there are numerous useful application examples for summarizing long texts in everyday work. With the help of machine learning, it is ensured that the sentences with the highest information density and the most important meaning remain unchanged. In day-to-day business, for example, they can always see at first glance what is important and at the same time reduce the flood of information to the essentials. Or you can make your work easier by choosing from a variety of reports, reviews, Reports or comments that directly display the most important core statements, advantages and disadvantages. Maybe just for your next meeting? Simply automatically reduce the content of your text or presentation to the most important core statements.

Pattern recognition

The classification of objects, also called machine vision, is a part of machine learning and is used to identify recurring patterns in documents. Very complex technology is required for information to be perceived by a machine as visual content. An algorithm classifies the individual documents. This independently assigns metadata or tags to each individual object. For this purpose, the document is divided into individual segments, which are then assigned a feature. If these characteristics are not meaningful enough, support the API with the correct assignment.

Also Read About: Revamp Traditional Business Processes with Innovative Custom SharePoint Development Services

Practical examples of extraction types

There are several ways to extract metadata. We present five of them here:

PDF Forms: Documents often contain form fields such as customer name, invoice number, date or product ID. Their content can be extracted and mapped to a SharePoint column.

Zone Extraction: Similar or similar documents often have the same layout, such as invoices or instructions. It is possible to tag these documents by extracting text from specific areas in PDF pages.

Document Metadata: Both standard and custom PDF metadata can be extracted and assigned to SharePoint columns. This can also include XMP metadata.

Entity Extraction: By using NLP services it is possible to extract values for information objects such as location, contact person or company.

Text rules: Documents can also be tagged by comparing the content with terms from the SharePoint Term Store. If the content of a document matches a term in the term store, the corresponding term is automatically added to the document as a tag

On the one hand, our AI solutions compare your texts with the contents of your term store, on the other hand, they also automatically add terms that appear frequently in your texts. This means that manual maintenance of the term store is no longer necessary.


Automated tagging makes managing information and documents much easier. Data is found faster because the content is logically structured. Our AI solutions support you both in the migration of your content and in your daily work.

Since October 1st, 2020, the new Syntex AI in SharePoint has made it possible to analyze the content of all company documents in order to be able to process them with other applications. For this purpose, the user’s documents are analyzed and the information is provided as metadata in order to then classify them according to their content. In this way, entries from forms, addresses, calendar data and personal names can be recognized by SharePoint. In this way, for example, business correspondence can be assigned automatically, or invoices and delivery notes can be recorded automatically.

If you need help with your project with Microsoft 365 or SharePoint, Find out more about our Microsoft Office 365 and Microsoft consulting firm.