Gen AI-Driven HTML To DITA Converter For a SaaS Provider

How a US SaaS provider transformed their services with an MVP of the AI-powered converter, achieving an impressive 85% accuracy rate in just six weeks and simplifying traditionally time-consuming CCMS adoption.

Industry:
Technology

Software Product Development

SaaS Development

MVP Development

AI Development

Business challenge

A company’s data is a double-edged sword. When in place, it brings game-changing insights. However, if it isn’t consistent and unified enough, organizations cannot capture the full value of their data resources.

SaaS software for content management is one of the ways to keep data in order. Usually, such tools operate with the DITA format, which is the gold standard for handling documentation at the enterprise scale. The problem is that adopting software with specific data format requirements can be a tricky call for companies with heterogeneous documentation in DOCX, HTML, PDF, etc.

To take advantage of a solid component content management system (CCMS), they first need to convert their data to DITA format. As of 2023, SaaS CCMSs on the market didn’t provide built-in converters, forcing organizations to pay extra for custom solutions for converting documentation into the required format.

Our client – a US company, building a SaaS CCMS – saw it as an opening opportunity to outshine the competitors and win new customers. They decided to capitalize on artificial intelligence for conversion operations and offer their clients a pre-CCMS AI-powered converter to DITA format to facilitate the adoption of their CCMS.

They reached out to *instinctools to develop an MVP with a gen AI linchpin of the future tool. How did our data scientists and software engineers pull it off?

Solution

For starters, our dedicated team aligned with the client regarding their fundamental ambitions and limitations, which were:

  • Covering as many input formats as possible
  • Rolling out an MVP in seven weeks max to save a leading spot in the CCMS market

Given the tight timeline, it was impossible to check both boxes. We suggested starting with one input format – HTML – for the sake of development speed and, then, proceed to adding others at the post-MVP stage.

  1. Release management

As the client wanted to go beyond creating a garden-variety tool and aimed to provide their customers with intelligent software at a budget-friendly cost, we offered them to cover HTML to DITA conversion with a large language model (LLM). This path provided:

  • Seamless scalability. We can start with HTML and train the model to work with other input formats later.

  • Unlimited evolution potential. An LLM coupled with natural language processing (NLP) technology can go beyond simple document conversion. Gen AI-driven software enables end-users to shorten or summarize a text, change it stylistically etc., while converting the file from one format to another.

Strict deadlines also had their sway on development conditions and our choices. We bet on the pre-trained language models instead of training one from scratch, fast-tracking the project and allowing the client to save the budget for the product’s further growth.

  1. Testing various LLMs to identify the most beneficial option

Our team reviewed language models of different sizes, from huge ones with almost two trillion parameters to smaller models with 7–70 billion parameters:

  • GPT-4 8x220B
  • LLama 13B, 70B
  • CodeLlama 34B
  • Yi 34B
  • Mistral 7B
  • Mixtral 2x7B, 8x7B
  • Dolphin 2.5 Mixtral 8x7B

The tests indicated that in the client’s case, opting for a smaller LLM would be more favorable for two main reasons:

  • They are easier and faster to train. Even though the model is pre-trained, the dedicated team still has to devote time to its fine-tuning
  • They have lower computational requirements. Such language models can run on less powerful hardware, while huge LLMs require numerous GPU processors. Thus, with a smaller LLM, the client could get to the converter development right away without replacing or strengthening their current hardware park

Our data scientists have chosen Mistral 7B among smaller LLMs, as despite its modest size, it has the same performance as larger models, such as CodeLlama 7B and Llama 2 13B.

Moreover, Mistral 7B use cases perfectly align with the client’s CCMS features:

  • Classifying and categorizing text to facilitate the management of vast data volumes
  • Summarizing and generating blocks of text within natural language processing to provide end-users with more ways to make use of the company’s data

However, selecting the right LLM was only half the battle. The model’s fine-tuning was on the horizon.

  1. Fine-tuning the LLM to strike a balance between memory consumption and accuracy

The calibration of Mistral 7B’s had several directions, aiming to provide at least an 80% accuracy rate for more complex tasks, such as summarizing the initial document, or changing its tone of voice, without increasing the client’s current computing power.

What is the accuracy rate?
Accuracy rate =
Number of correct
answers
Total number of answers to the queries
Reducing memory consumption

Extensive memory consumption is often one of the major roadblocks for companies that want to adopt an LLM. On average, you need 16 GB of GPU memory per device to train a smaller model with just one billion parameters. Thus, the larger is the language model you want to embrace, the more computational resources you’ll need.

There are two ways to deal with high memory consumption:

  • Updating your hardware to match the LLM’s appetite
  • Utilizing special fine-tuning techniques to reduce memory consumption

Driven by time and budget limitations, we’ve chosen the second approach and went with the parameter-efficient fine-tuning (PEFT) method, namely the quantized low-rank adaption method (QLoRA), instead of the full fine-tuning approach.

QLoRA is a 4-bit transformer that combines a high-precision computing technique with a low-precision storage method. Such a combo allows to train large models on small computing resources while still making sure it’s highly performant and accurate.

The results we received after fine-tuning the model proved our choice correct — over 85% of the samples fit within the model context, exceeding our initial goal by 5%.

However, the XML validity metric, indicating the accuracy of document conversion to DITA format, was low, with around 10% precision. Therefore, our next aim was to boost it to at least 70%.

Honing the model’s precision

Due to the issues with XML validity, the algorithm was producing documents in formats similar to DITA but actually not being it. To address this challenge, our team took three steps.

What is the precision rate?
Precision rate =
Number of relevant documents retrieved Total number of documents retrieved
  • Leveraged granulation approach to split a text into smaller chunks

Dividing a massive text into shorter pieces and offering the model to process them one by one is called a recursive tree strategy. We adopted this approach to eliminate the probability of losing the segments of the large documents.

Here’s a simple scheme evaluating two ways of the document procession.

  • Created an automated agent loop to fix the issues inside the LLM

After running initial training on the client’s data, we gathered and analyzed the most frequent output errors of our model. These flaws could have been fixed with manual prompting to ensure a 100% precision rate, but this option required additional investments for each percent of precision and didn’t fit the project’s tight timeline.

Therefore, we adopted a quicker way to solve this problem and crafted an automated agent loop with prompts for correcting common errors before exposing conversion results to the end users.

  • Implemented an open-source DITA validity checker to spot format mistakes in the output files

The last step of internal checks included automated verification of the results regarding publicly available sources to wipe out format errors. These results then serve as the basis for an additional prompt for error correction.

All these actions enabled us to correct document conversion to the DITA format x7.5 times.

Here’s the final scheme of the whole HTML to DITA conversion process. Instinctools’ team will keep perfecting the LLM’s outputs at the post-MVP stage to ensure more than 90% precision and accuracy without spikes in memory consumption.

Before

  • Lack of a user-friendly and accurate tool for content preparation before CCMS adoption

  • Slow-moving pre-CCMS phase due to handling each client’s conversion request individually

  • Overloading the staff with repetitive tasks that could have been automated

After

  • Intuitive tool for intelligent content conversion at the pre-CCMS stage
  • Fast-tracking the preparation stage thanks to covering common conversion cases with a new software
  • Saving the time for employees to solve tricky challenges that call for a human touch

Business value

  • Stable MVP was rolled out in six weeks, a week ahead of the initial timeline
  • New enterprise-grade SaaS product for AI-driven content conversion empowers the client to strengthen their market position
  • 85% conversion accuracy rate with a perspective of further improvements
  • 75% precision rate with room for improvement
  • Low memory consumption that allows the client to stay within their current hardware park
  • Limitless evolution potential covering NLP-related functions

In our client’s words

Here’s how the client’s CTO describes the cooperation with *instinctools:

quote-icon

We’ve been raising our internal expertise in AI during the last two years, but the *instinctools battle-tested knowledge turned out to be the missing puzzle that enabled us to finally step into the artificial intelligence era. We continue working on the post-MVP stage and expect the *instinctools team to keep supporting us in creating unprecedented value for our customers.

Multiplier effect

The LLM market keeps growing and getting more diverse. And as business owners eye into this variety, they realize that they don’t necessarily have to opt for the word-of-mouth ChatGPT and other well-known garden-variety options.

Smaller LLMs democratize AI, enabling companies of all sizes to explore the power of artificial intelligence and language models. High-level performance is already up for grabs without significant infrastructure investments.

Besides bringing numerous benefits when implemented within SaaS products, smaller LLM can be used by companies from various industries to enhance and accelerate their internal processes. For instance, you can craft your private gen AI-driven chatbot to simplify and speed up knowledge retrieval and finding the needed data across the company’s data assets.

It’s time to pump up your gen AI muscle. Step confidently into the future with a reliable tech ally by your side.

Do you have a similar project idea?

Anna Vasilevskaya
Anna Vasilevskaya Account Executive

Get in touch

Drop us a line about your project at contact@instinctools.com or via the contact form below, and we will contact you soon.