Home Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation

1Peking University, 2Microsoft, 3Nanjing University, 4Tsinghua University

Abstract

Recent advances in retrieval-augmented generation have significantly improved the performance of question-answering systems, particularly on factoid '5Ws' questions. However, these systems still face substantial challenges when addressing '1H' questions, specifically how-to questions, which are integral to decision-making processes and require dynamic, step-by-step answers. The key limitation lies in the prevalent data organization paradigm, chunk, which divides documents into fixed-size segments, and disrupts the logical coherence and connections within the context. To overcome this, in this paper, we propose Thread, a novel data organization paradigm aimed at enabling current systems to handle how-to questions more effectively. Specifically, we introduce a new knowledge granularity, termed 'logic unit', where documents are transformed into more structured and loosely interconnected logic units with large language models. Extensive experiments conducted across both open-domain and industrial settings demonstrate that Thread outperforms existing paradigms significantly, improving the success rate of handling how-to questions by 21% to 33%. Moreover, Thread exhibits high adaptability in processing various document formats, drastically reducing the candidate quantity in the knowledge base and minimizing the required information to one-fourth compared with chunk, optimizing both efficiency and effectiveness.

How-to Questions

How-to questions are inherently more complex, often involve processes that require interpretation and analysis. For example, correctly answering below question involves a multi-step decision-making process. Such a dynamic nature necessitates the RAG systems incrementally guide users through each step, adapting to specific circumstances and providing precise and logical information.

Methodolody

Thread: LU-based Knowledge Base. Thread is a knowledge base composed of discrete but interconnected retrieval units called Logic Units (LUs). Different from the chunk-based paradigm, LU comprises specially designed components, especially for bridging consecutive steps when addressing how-to questions, considering the internal logic and coherence of documents. Specifically, each consists of 'Prerequisite', 'Header', 'Body', 'Linker' and 'Meta Data' with types of 'Step', 'Terminology', 'FAQ' and 'Appendix'. In practice, documentation is often unstructured and varies in format and style. Our approach to converting documentation into Thread involves a two-stage process to obtain LUs.

Integrate Thread with QA System. When answering how-to questions, the system integrated with Thread enables a dynamic interaction manner. First, it retrieves relevant LUs based on their indexed headers. Then, the body of the selected LU provides the necessary content to generate responses for the current step. With user feedback, the linker in LU dynamically connects to other LUs, allowing the system to adapt its responses until the how-to question is comprehensively addressed.

Experimental Results

We evaluate Thread on two open-domain scenarios: Web Navigation (Mind2web) and Wikipedia Instructions (WikiHow), and one industrial scenario: Incident Mitigation (IcM). More details about datasets can be found in our paper.

Main Results

Mind2web: For Mind2web, we compare our method with baselines and both doc-based and chunk-based RAG methods. We observe that providing informative documents (regardless of the data organization paradigm) within RAG methods significantly improves performance. RAG methods outperform ICL and SL methods. By incorporating external documents, the doc-based RAG method achieves performance comparable to the best results of MINDACT. Notably, our Thread paradigm is the best among RAG methods, showing improvements of 4.06% in Ele. Acc, 3.49% in Step SR and 3.57% in SR.

WikiHow: In the single-turn setting, where the QA system generates the entire plan in one interaction, our Thread-based method outperforms the doc-based method which provides the entire relevant document, indicating that Thread offers information with less redundancy. In the multi-turn setting, the QA system performs iterative retrieval, which shows superior performance compared to single-turn. This indicates the advantage of a step-by-step approach in handling how-to questions. Notably, the SR improves significantly from 58.76% to 68.04% in doc-based paradigm and from 63.91% to 72.16% in Thread paradigm.

Incident Mitigation: Both chunk-based and doc-based RAG methods exhibit comparable ineffectiveness in incident mitigation. In contrast, our Thread paradigm achieves the best score across all metrics, particularly achieving a significant increase in SR from 21.02% to 33.33%. Furthermore, P.F. Step SR indicates our method's capability to handle significantly more steps before encountering a failure, thus reducing the need for human intervention. Consequently, our method achieves the highest Step SR and requires the fewest interaction Turns to mitigate both simple and complex incidents.

Analysis

Data Organization Paradigms: We compare various data organization paradigms within RAG, implementing Semantic Chunking and Proposition in addition to recursive chunking (chunk-based) and entire document (doc-based) approaches. Our Thread paradigm outperforms the other paradigms across all metrics. Although Semantic and Proposition methods use LLMs to merge semantically similar sentences, they fail to adequately address logical connections, resulting in poor performance. Additionally, our system processes the smallest token length of retrieval units yet achieves the highest performance, underscoring our approach's efficiency and effectiveness.

Document Formats in LU Extraction: We tested our LU extraction method in accommodating varying document structures. We generated different document formats, including structured markdown, hierarchical guidelines, tabular checklists, and narrative documents. This table shows that our Thread paradigm effectively organizes these formats, consistently outperforming the chunk-based paradigm across all metrics. Our Thread improves Ele. Acc by up to 9.61%, Op. F1 by up to 8.43%, and Step SR by up to 8.51%. Notably, our Thread achieves the highest performance with structured documents, as this format helps construct a higher-quality knowledge base.

Different LLMs as Backbone: We further conduct experiments on LLaMA3-70B, as shown in Figure 4. Although LLaMA3-70B isless powerful than GPT-4, it still demonstrates competitive performance with the help of THREAD. Results indicate that LLaMA3 struggles with WikiHow questions applying chunk-based or doc-based paradigms. However, with the integration of THREAD, LLaMA3 not only achieves much better performance in SR, but also narrows the gap with GPT-4. Specifically, LLaMA3 achieves SR and F1 scores of 60.82% and 72.84%, respectively, compared to GPT-4’s scores of 72.16% and 80.74%. This indicates that while a performance gap remains, particularly with the chunk-based and doc-based paradigms, our THREAD considerably reduces this disparity between GPT-4 and LLaMA3. These findings highlight the value of our proposed paradigm in enhancing the performance of different LLMs, showcasing its generalizability, robustness, and efficiency in handling how-to questions.

BibTeX

@article{an2024thread,
    title={Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation},
    author={An, Kaikai and Yang, Fangkai and Li, Liqun and Lu, Junting and Cheng, Sitao and Wang, Lu and Zhao, Pu and Cao, Lele and Lin, Qingwei and Rajmohan, Saravan and others},
    journal={arXiv preprint arXiv:2406.13372},
    year={2024}
}