大模型融入推荐系统

结合项目实际给用户推荐，比如是商家项目，用户问了几个关于商品的信息，大模型就可以根据根据用户画像，给用户推荐商品。

我们现在做的是针对于用户学习的推荐，首先我们要对我们的数据进行处理，提取出我们数据的一些特征

首先这个文件夹下可以放一些文件当做子目录，这些子目录就可以当做一些course

遍历文件，转换为markdown,然后读取里面的标题内存，然后存储到csv文件中。此时还缺少摘要，所以我们用大模型来读取内容从而生成摘要。

def generate_csv_for_pdfs(root_dir):# 搜索指定根目录下的所有PDF文件pdf_files = glob.glob(f'{root_dir}/**/*.pdf', recursive=True)data = []for pdf_file in pdf_files:# 将 PDF 格式转化成 Markdown 格式markdown = convert_pdf_to_markdown(pdf_file)# 根据Markdown的内容结构，提取每一部分的主题内容toc_content = extract_content_by_sections(markdown)# 第一个标题部分是课程的主标题titles = list(toc_content.keys())first_title = titles[0] if titles else ""# 收集第一标题下的二级标题作为子标题first_section_content = toc_content.get(first_title, "")first_section_lines = first_section_content.split('\n')sub_titles = [line.strip() for line in first_section_lines if line.startswith('##')]sub_titles_cleaned = [re.sub(r'^##\s+', '', title) for title in sub_titles]for module_name, content in toc_content.items():# 提取二级标题作为 Tagstags = [line.strip() for line in content.split('\n') if line.startswith('##')]tags = [re.sub(r'^##\s+', '', tag) for tag in tags]  # 清理 '##'# 构建元数据data.append({'ModuleID': str(uuid.uuid4()),'Course': os.path.basename(os.path.dirname(pdf_file)),'Title': sub_titles_cleaned,'URL': os.path.basename(pdf_file),'ModuleName': module_name,'Tags': ", ".join(tags),'Content': content})df = pd.DataFrame(data)csv_file_path = os.path.join(root_dir, 'course_metadata.csv')df.to_csv(csv_file_path, index=False)print(f"CSV file generated: {csv_file_path}")

然后生成摘要

构建文档的画像，执行某些列，把这些列合并组成文档的画像，为一个新的列embedding_info,

这个embeding_info_的list是一个列表

把这些用户画像存入到向量数据库中，执行的是do_add_file方法

系统调用

之前的步骤略

     # 在同一个 model 实例上同时运行两个异步链（LLMChain）可能导致内部状态的混乱,所以为用户画像生成和聊天响应分别实例化模型# 该模型实例用于生成用户画像model_for_profile = get_ChatOpenAI(model_name=model_name,temperature=TEMPERATURE,max_tokens=MAX_TOKENS,)# 该模型实例用于生成聊天响应model_for_chat = get_ChatOpenAI(model_name=model_name,temperature=TEMPERATURE,max_tokens=MAX_TOKENS,callbacks=callbacks,)

如果用户历史对话超过5轮就生成用户画像

生成用户画像，根据用户的历史对话信息，采用这个提示模版生成用户画像

# 生成用户画像: 通过理解`用户历史行为序列`，生成`用户感兴趣的话题`以及`用户位置信息`
user_profile_prompt = """
请你根据历史对话记录：\n\n{chat_history}如上对话历史记录所示，请你分析当前用户的需求，并描述出用户画像，用户画像的格式如下：[Course]
- (Course1)[ModuleName]
- (ModuleName1)其中课程名称 [Course] 请务必从下面的列表中提取出最匹配的：\n["在线大模型课件", "开源大模型课件"]最后，一定要注意，需要严格按照上述格式描述相关的课程名称和课程的知识点，同时，[Course] 和 [ModuleName] 一定要分别处理，你最终输出的结果一定不要输出任何与上述格式无关的内容。
"""

async def generate_user_profile_and_extract_info(chat_messages: List[str], user_profile_prompt: str, model) -> Dict[str, List[str]]:"""异步生成用户画像并从中提取课程和模块信息。:param chat_messages: 聊天历史消息列表:param user_profile_prompt: 用于生成用户画像的提示:param model: 已实例化的模型对象:return: 包含课程和模块名称的字典"""# 创建聊天提示模板prompt_template = ChatPromptTemplate.from_messages([("user", user_profile_prompt),])# 创建LangChain的链user_profile_chain = LLMChain(prompt=prompt_template, llm=model)# 异步生成用户画像user_profile_result = user_profile_chain.invoke({"chat_history": chat_messages})user_profile = user_profile_result["text"]# 定义正则表达式并提取课程与模块信息def extract_course_and_module(text: str) -> Dict[str, List[str]]:course_pattern = r"\[Course\]\s+-\s+(.+)"module_name_pattern = r"\[ModuleName\]\s+-\s+(.+)"courses = re.findall(course_pattern, text)module_names = re.findall(module_name_pattern, text)return {"Course": courses, "ModuleName": module_names}# 提取信息并返回return extract_course_and_module(user_profile)

然后接下来就是根据用户画像去和向量数据库中的内容匹配，如果量很大，可以把信息存储到es中，做倒排索引