Skip to main content

Doctran: interrogate documents

Documents used in a vector store knowledge base are typically stored in a narrative or conversational format. However, most user queries are in question format. If we convert documents into Q&A format before vectorizing them, we can increase the likelihood of retrieving relevant documents, and decrease the likelihood of retrieving irrelevant documents.

We can accomplish this using the Doctran library, which uses OpenAI’s function calling feature to “interrogate” documents.

See this notebook for benchmarks on vector similarity scores for various queries based on raw documents versus interrogated documents.

! pip install doctran
import json

from langchain.document_transformers import DoctranQATransformer
from langchain.schema import Document
from dotenv import load_dotenv

load_dotenv()
True

Input

This is the document we’ll interrogate

sample_text = """[Generated with ChatGPT]

Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures
As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.

HR Updates and Employee Benefits
Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).

Marketing Initiatives and Campaigns
Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.

Research and Development Projects
In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.

Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.

Thank you for your attention, and let's continue to work together to achieve our goals.

Best regards,

Jason Fan
Cofounder & CEO
Psychic
jason@psychic.dev
"""
print(sample_text)
[Generated with ChatGPT]

Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures
As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.

HR Updates and Employee Benefits
Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).

Marketing Initiatives and Campaigns
Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.

Research and Development Projects
In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.

Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.

Thank you for your attention, and let's continue to work together to achieve our goals.

Best regards,

Jason Fan
Cofounder & CEO
Psychic
jason@psychic.dev
documents = [Document(page_content=sample_text)]
qa_transformer = DoctranQATransformer()
transformed_document = await qa_transformer.atransform_documents(documents)

Output

After interrogating a document, the result will be returned as a new document with questions and answers provided in the metadata.

transformed_document = await qa_transformer.atransform_documents(documents)
print(json.dumps(transformed_document[0].metadata, indent=2))
{
"questions_and_answers": [
{
"question": "What is the purpose of this document?",
"answer": "The purpose of this document is to provide important updates and discuss various topics that require the team's attention."
},
{
"question": "Who is responsible for enhancing the network security?",
"answer": "John Doe from the IT department is responsible for enhancing the network security."
},
{
"question": "Where should potential security risks or incidents be reported?",
"answer": "Potential security risks or incidents should be reported to the dedicated team at security@example.com."
},
{
"question": "Who has been recognized for outstanding performance in customer service?",
"answer": "Jane Smith has been recognized for her outstanding performance in customer service."
},
{
"question": "When is the open enrollment period for the employee benefits program?",
"answer": "The document does not specify the exact dates for the open enrollment period for the employee benefits program, but it mentions that it is fast approaching."
},
{
"question": "Who should be contacted for questions or assistance regarding the employee benefits program?",
"answer": "For questions or assistance regarding the employee benefits program, the HR representative, Michael Johnson, should be contacted."
},
{
"question": "Who has been acknowledged for managing the company's social media platforms?",
"answer": "Sarah Thompson has been acknowledged for managing the company's social media platforms."
},
{
"question": "When is the upcoming product launch event?",
"answer": "The upcoming product launch event is on July 15th."
},
{
"question": "Who has been recognized for their contributions to the development of the company's technology?",
"answer": "David Rodriguez has been recognized for his contributions to the development of the company's technology."
},
{
"question": "When is the monthly R&D brainstorming session?",
"answer": "The monthly R&D brainstorming session is scheduled for July 10th."
},
{
"question": "Who should be contacted for questions or concerns regarding the topics discussed in the document?",
"answer": "For questions or concerns regarding the topics discussed in the document, Jason Fan, the Cofounder & CEO, should be contacted."
}
]
}