Auto Classify Documents in SharePoint using Azure Machine Learning Studio Part 1

I was trying to figure out a way to get the text representation of documents stored either in OneDrive or SharePoint Online to execute some text analysis techniques using Azure machine learning studio. I know for sure (not really just guessing) that SharePoint search index store text representation of the document but I guess this version of the document is not exposed to us

Hmm #CognitiveServices can now tell you if text has profanity or derogatory terms.https://t.co/2dumIM2dkR

Waiting on @onedrive #ContentServices to return docx as txt then it'd be two @MicrosoftFlow actions.

Make a great O365 scanner that would flag docs to follow up

— John Liu 劉 (@johnnliu) March 28, 2018

I also happen to know (this one is not a guess) from the good old days that SharePoint search uses IFilter to get file content as text then store this text in the index. I tried to do it in a different way.

So I figured how about doing this document to text conversion myself, I found Tika text extraction library handy apache open source tool which has been ported as .NET nuget package

I’ve create a simple Azure function using Visual studio, It has been a while since I used the full fledged Visual studio as I’ve been using mostly Visual studio code lately, as you guys can see the azure function is pretty straight forward just 4 lines of code to convert the docx files to text representation so we can use any text analysis techniques on our SharePoint documents.

using System.Linq;
using System.IO;
using System.Net;
using System.Net.Http;
using System.Threading.Tasks;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Azure.WebJobs.Host;
using TikaOnDotNet.TextExtraction;

namespace doc2text
{
    public static class Convert
    {
        [FunctionName("Convert")]
        public static async Task<HttpResponseMessage> Run([HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = null)]HttpRequestMessage req, TraceWriter log)
        {
            log.Info("C# HTTP trigger function processed a request.");           
            byte[] data = await req.Content.ReadAsByteArrayAsync();
            var textExtractor = new TextExtractor();            
            return req.CreateResponse(HttpStatusCode.OK, textExtractor.Extract(data).Text);

        }
    }
}

view raw

convert.cs hosted with ❤ by GitHub

Now let’s hook this to a simple Flow which been triggered when a new file been uploaded to specific SharePoint library.
The flow will start then it will trigger the azure function which will extract the text representation of the office document and send it to a web-service to do some text analysis and return the document classification value.

File Content

Method: Post

Then within the flow itself we can update the SharePoint document and update the classification as per the text analysis result.

File properties

Hint: We will consider this web service call used in this flow as HTTP2 as a black box for now. to give you a sneak peak It’s based on multi-class neural network classification algorithm built using Azure Machine Learning Studio and we will discuss this particular building block in more details in part 2 of this series.

now let’s upload a new word document that have a text represents a business article and let’s see the updated category text value

Classifications – Business

Here we go , our smart document categorization flow is able to classify the document as business document.

In the next part of this blog series, we will discuss the azure machine learning studio experiment in more details.

About the Author:

Amr Fouad is a Technology Evangelist, SharePointer ,@OfficeDev MVP, Speaker, Anime addict, speed ‘nd sugar junkie and a huge believer!

Reference: Fouad, A (2018). Auto Classify Documents in SharePoint using Azure Machine Learning Studio Part 1. Available at: http://www.sharepointtweaks.com/2018/04/auto-classify-Office365-content-using-azure-machine-learning-studio-part1.html

Share this on...

Keep up, Get ahead

You’re almost there…

Auto Classify Documents in SharePoint using Azure Machine Learning Studio Part 1

You might also like ...

Find an employees user details from their email address with Power Automate

Why wouldn’t you want ROI on your O365 investment

Does Office 365 require a third-party backup? A Microsoft MVP’s perspective

Recent Posts

Rate This Post

Join our Mailing List!

Resource Centre Login - Content

Resource Centre Login - Content

Email Updates Signup

STAY UP TO DATE - JOIN OUR MAILING LIST

Super Early Bird Sale Ends Soon
	,		,		,

Keep up, Get ahead

You’re almost there…

You might also like ...

Find an employees user details from their email address with Power Automate

Why wouldn’t you want ROI on your O365 investment

Does Office 365 require a third-party backup? A Microsoft MVP’s perspective

Trending Posts

Recent Posts

Rate This Post

Join our Mailing List!

Resource Centre Login - Content

Resource Centre Login - Content

STAY UP TO DATE - JOIN OUR MAILING LIST