Introduction to Open XML SDK 2.5

This post will describe the basic concept and usage of Open XML SDK 2.5 for Microsoft Office. We will talk about why Open XML SDK exists and when to use it.

Background

In the 90s, it was very difficult to manipulate or modify the content of an Office document. Complex coding and structuring was required to read/update document contents. The only way to modify the content of an existing document, or to create a new one, was to do it byte by byte. Later on, some COM add-ins were introduced for .NET to work with MS Office components that allow you to automate the Office Applications. Using add-ind and OWC (Office Web Component) you can interact with the MS office user interface, and that's still the best solution if you need the complete functionality of Word, Excel, and PowerPoint. But if you just want to modify the document content, then do you think it's the best solution?

Actually, loading an Office application, and automating it using some methods definitely takes some time and it’s really quite slow to load a bunch of documents and make a change to each of them. What if you could modify the content, or retrieve information programmatically without having to even own a copy of Microsoft Office? By using the Open XML file formats, that's completely possible.

Open XML SDK 2.5

Open XML is an open standard and is based on well-known ZIP and XML file formats. Open XML stores all files in a ZIP archive for packaging and compression. It is a collection of strongly-typed classes which is built on top of the System.IO.Packages API. An MS office document keeps all information in the form of XML, and the Open XML SDK lets you manipulate the XML contents using the object model constructed for document parts. These manipulations can even be be done without having to install any Office products.

The beauty of Open XML SDK is that it supports markup compatibility in a way that makes easy for you to open a latest version document in an older version. In addition, it doesn’t work with XML directly but with strongly typed classes. 

A limitation of Open XML SDK is that it only supports Word, Excel, and PowerPoint document formats i.e. DOCX, PPTX and XLSX. Any document which is created in Office 2007 or greater can be manipulated in Open XML SDK but for documents created in older versions of Office, other add-ins will have to be used. Let's see what else is not supported:

  • It doesn’t allow to convert Open XML formats from/to other formats like PDF, HTML, or XPS. If you want to do that, you need to use the respective Office application.
  • It doesn’t validate the document content, which may lead to building an invalid document.
  • It also doesn’t support other document features like layout functionality, data refreshing, or recalculation etc.

As we discussed above, Open XML format is built on multiple document parts and these parts are kept in zipped folders. If you want to see the document content, just rename “template.docx” to “template.docx.zip”, and you will be able to see the complete content hierarchy. If you don't know how Word (or Excel or PowerPoint) stores text in their files, then the Open XML SDK Productivity Tool makes it very easy to examine the zipped package contents.

 

The above image shows the Open XML SDK Productivity Tool in action: it displays the contents of a DOCX files in a relational, tree-like structure, while a general ZIP extraction software (left) shows the bare contents of the file.

A basic Open XML file consists of:

_rels: A directory containing all relationship information among document parts.
docProps: A directory containing core document properties.
[Content_Types].xml: A file used to determine the content type of each part and to keep information about markup language of the document.

Apart from the above, a directory specific for the document type also exists, which contains the core content of the document and is named word, xl, and ppt for DOCX, XSLX, and PPTX respectively.

If you notice, the document parts are created as XML markup and these parts can parse the contents using XPath processes. Each part has a part name that consists of a sequence of segments or a path name such as "/word/theme/theme1.xml.". In this blog, I am not going to explain open xml markup languages which are used to manipulate the document.

Open XML SDK Example

Let's take an example, how can we merge content using Open XML SDK. Let’s say we have a template which contains some merge fields. We would like to replace merge fields with actual data. Make sure merge field name should be similar to your data source. Data source could be text file, csv, excel, xml or class.
In this example I have 6 merge fields in document Date, Recipient, Title, Comment, Reviews and Author. These fields should be replace with actual data will be provided as data source. Template document look like below

Open Xml stores merge field information as shown in below image. I assume that you have idea about relation among Paragraph, Run and Text. The merge fields in the OpenXML are represented as SIMPLEFIELD elements and can contain child RUN elements. Field name is represented as a child TEXT element inside the RUN element. A RUN element can also have a RUNPROPERTIES element with additional layout information about the displayed text, which we don't want to lose, because we'd like our data to keep the same layout as the merge field has in the template.

  • First of all create a console application in VS2015 or 2013. Add 2 references
    • Windows base Assembly
    • DocumentFormat.OpenXml.dll
  • Create a directory in project named “Template” and place your tamplate in directory and include file in project.
  • Create your datasource. I have created an entity model called “model.edmx” to get the datasource from database. Edmx contains only one table called DocumentContent and has following fields
  • Create a class to connect with database and write a function which return a datasoruce for merge fields. My class name is “DatabaseConnector.cs”. The code will return a list of DocumentContent type object.
    public class DatabaseConnector
       {
          public Model.DocumentContent GetDocumentContent()
           {
               using (var context = new Model.OpenXmlDemoEntities())
               {
                   return context.DocumentContents.FirstOrDefault();
               }
           }
    }
  • Create a class to manipulate template document. My class name “MergeDocument.cs”. The code creates a Property and constant string in class. Open xml requires Markup language namespace to understand the document part.
    Template: To store template information
    WordproMLNamespace: To store template markup language namespace.
      public FileInfo Template { getprivate set; }
      public const string WordproMLNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
  • The code validates Template property values in constructor. If user didn’t provide valid input to process the doucment then code will throw error.
       public MergeDocument(string templatePath)
       {
           if (string.IsNullOrEmpty(templatePath))
               throw new ArgumentNullException("The Template file path can’t be empty.");
           FileInfo template = new FileInfo(templatePath);
           if (!template.Exists)
               throw new ArgumentNullException("The Template file does not exists on specified path.");
           if (!template.Extension.Equals(".docx"))
               throw new ArgumentException("The Template must be of type docx OR doc only");
           Template = template;
 
       }
  • Create a function MergeTemplate in this class which will actually merge field in provided template. This function will take 2 parameters as input. A datasource and target file name. The code first validatesinput parameters values then open and load template file into memory stream. This memory stream object will be used to modify using WordprocessingDocument class object

    Code also gets datasouce object properties to match property name with merge field. The code takes first record set from list and extract all item properties through reflections. I am assuming that you already know about System.Reflection class.

    Then read all the merge fields from document as SimpleField object. Loop through each field, extract merge field name and tries to find match with data source property. If match found than replace the merge field with data.

    Finally, save updated document into memory stream and write stream content in a file named “mergedocument.docx”. This file will save in the executable assembly folder. You can specify path if you want to save it any particular directory.

    public void MergeTemplate(Model.DocumentContent datasource, string targetFileName)
            {
                if (datasource == nullthrow new ArgumentNullException("The Data list cannot be null.");
                if (string.IsNullOrEmpty(targetFileName)) throw new ArgumentNullException("The Target File Name cannot be null.");
               
     
                // Open Template
                byte[] sourceBytes = File.ReadAllBytes(Template.FullName);
                using (MemoryStream _memoryStream = new MemoryStream())
                {
                    // Load into memory
                    _memoryStream.Write(sourceBytes, 0, sourceBytes.Length);
     
                    using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(_memoryStream, true))
                    {
                        // Get DocumentContent properties
     
                        Type myType = datasource.GetType();
                        IList<PropertyInfo> props = new List<PropertyInfo>(myType.GetProperties());
                        // Replace this with below code later
                         string fieldName = field.GetAttribute("instr", WordproMLNamespace).Value.Replace("MERGEFIELD"string.Empty).Trim();
                      // Get all merged fields from document
                        foreach (SimpleField field in wordDocument.MainDocumentPart.RootElement.Descendants<SimpleField>())
                           {
                          
                            // match fieldname with property name and replace text with data
                            var property = props.Where(p => p.Name == fieldName).FirstOrDefault();
                            if (property != null)
                            {
                                var value = property.GetValue(datasource).ToString();
                                
                                field.Descendants<Run>().ToList().ForEach(
                                     run => run.Descendants<Text>().Where(d => d.Text.Contains("«" + fieldName + "»")).ToList().ForEach
                                     (s => { s.Text = s.Text.Replace("«" + fieldName + "»", value); })
                                    );
     
                            }
     
                        }
                        // save document
                        wordDocument.MainDocumentPart.Document.Save();
     
                        // Save in output directory
                        // Create a new document based on updated template
                        using (FileStream fileStream = new FileStream(targetFileName, FileMode.Create))
                        {
                            _memoryStream.WriteTo(fileStream);
                        }
                    }
                }
            }
  • There are multiple ways to get merged fields form document. You can get all merge field by getting document Body and extract fields using XPath element e.g.

    Body newbody = wordDocument.MainDocumentPart.Document.Body; 
    
    IList<XElement> mailMergeFields =
                         (from el in newBody.Descendants()
                          where el.Attribute(XMLNS + "instr") != null
                          select el).ToList();

    If you notice in above code written in class, I am extracting field name from Field instr attribute. This code will only work if Field Code only have Fieldname without switches.

    If you want to set other switches with merge field then above code will not work as intended. For example, I would like to display my merge field value as FirstCap format and would like to insert some text before merge field. So the Field code will be as following

    So the better way is to get merge field name is, get substring after MERGEFIELD and ignore rest of switches settings. It is possible that some of field doesn’t have any switches, in that case IndexOf(' ') will be -1.

    var fieldName = "";
    var instruction = field.GetAttribute("instr", WordproMLNamespace).Value;
    var substring = instruction.Substring("MERGEFIELD ".Length, ((instruction.Length - 1) - "MERGEFIELD ".Length)).TrimStart();
    // get string before first occurence of space
    if (substring.IndexOf(' ') > -1)
     {
        fieldName = substring.Substring(0, substring.IndexOf(' '));
      }
     else
         fieldName = substring;

     

  • Now, call above class in main class. For this, I need 3 properties. Load the template into FileInfo object and send to merge template method. Output of template is shown below.

 

   class Program
   {
       public static FileInfo Data { getset; }
       public static FileInfo Template { getset; }
       public static DirectoryInfo Target { getset; }
 
       static void Main(string[] args)
       {
           FromDatabaseData();
       }
       private static void FromDatabaseData()
       {
           try
           {
               // Test Data Files  template and data
 
               var directory = System.IO.Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location) + @"\Template";
               Template = new FileInfo(directory + @"\DemoTemplate.docx");
               Target = new DirectoryInfo(directory);
 
 
               // Load Data into data set
               DatabaseConnector dbConnector = new DatabaseConnector();
               Model.DocumentContent docuementContent = dbConnector.GetDocumentContent();
 
          
               // merge document
               MergeDocument mdoc = new MergeDocument(Template.FullName);
               mdoc.MergeTemplate(docuementContent, "MergedDocument.docx");
                  
 
               Console.WriteLine("File has been create in executable assembly folder.  Press enter to exit....");
               Console.ReadLine();
           }
           catch (Exception ex)
           {
               Console.WriteLine(ex);
               Environment.Exit(1);
           }
       }




Looking forward to hear from you to improve the quality of post.

comments powered by Disqus