Spaces:

DjangoPeng
/

GitHubSentinel

Sleeping

File size: 19,736 Bytes

3851ff8

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "690b8105-a04e-4f9a-801a-767d5db93f90",
   "metadata": {},
   "source": [
    "# 进展报告自动生成\n",
    "\n",
    "基于项目文件（GitHubClient）调用大模型（LLM）自动生成项目进展报告。\n",
    "\n",
    "### 调用 OpenAI GPT 大模型\n",
    "\n",
    "相比 GitHub REST API ，OpenAI 提供的大模型相关 API 迭代速度快，且不够稳定。\n",
    "\n",
    "**GPT-4 很难能够准确的生成 OpenAI Client 相关代码**。\n",
    "\n",
    "因此，GitHubSentinel 项目中 LLM 相关调用代码由人类编写😁。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de8f9beb-4de1-4ead-ad2d-4cc1b8692390",
   "metadata": {},
   "source": [
    "## Prompt 优化测试\n",
    "\n",
    "基于 `GithubClient` 模块获取的 Repo 最新进展，先在 ChatGPT 中尝试获取可用的提示词（Prompt）方案。\n",
    "\n",
    "-  **完整的ChatGPT 对话记录【GitHubSentinel 提示词优化】**：https://chatgpt.com/share/28524ea6-2bf3-4ebe-b7d9-9c1ba5f005d2\n",
    "- 以下测试使用的 LangChain 项目文件为: `./daily_progress/langchain-ai_langchain/2024-08-18.md'`\n",
    "\n",
    "\n",
    "### ChatGPT（GPT-4） 生成报告\n",
    "\n",
    "**Langchain-AI/Langchain Daily Progress Report - 2024-08-18**\n",
    "\n",
    "### 新增功能\n",
    "1. **Langchain 模块添加**\n",
    "   - 新增了Langchain Box套件及其文档加载器 (`langchain-box: add langchain box package and DocumentLoader` #25506)\n",
    "   - 加入了新的社区提供者—Agentic RAG 示例 (`Community: Add Union provider - Agentic RAG example` #25509)\n",
    "   - 引入了更多的异步测试标准 (`standard-tests[patch]: async variations of all tests` #25501)\n",
    "   - 引入了对多种区块链的支持 (`community: add supported blockchains to Blockchain Document Loader` #25428)\n",
    "   \n",
    "2. **文档与API更新**\n",
    "   - 更新了多个集成参考文档和Langchain版本的文档 (`docs: `integrations` reference update 9` #25511, `docs 0.3 release` #25459)\n",
    "   - 增加了新的文档索引和数据加载方式的说明 (`[docs]: more indexing of document loaders` #25500)\n",
    "\n",
    "### 主要改进\n",
    "1. **测试与标准化**\n",
    "   - 添加了更多嵌入标准测试 (`more embeddings standard tests` #25513)\n",
    "   - 引入了JSON模式的标准测试 (`json mode standard test` #25497)\n",
    "   - 新增了各种文档加载器的文档 (`[Doc] Add docs for `ZhipuAIEmbeddings`` #25467)\n",
    "\n",
    "2. **框架和规则改进**\n",
    "   - 对Langchain核心模块进行了Pydantic解析器修复 (`langchain-core: added pydantic parser fix for issue #24995` #25516)\n",
    "   - 增加了B(bugbear) ruff规则以提高代码质量 (`core: Add B(bugbear) ruff rules` #25520)\n",
    "\n",
    "3. **集成与兼容性**\n",
    "   - 测试了Pydantic 2和Langchain 0.3的兼容性 (`openai[major] -- test with pydantic 2 and langchain 0.3` #25503)\n",
    "   - 准备了向Pydantic 2迁移的根验证器升级 (`openai[patch]: Upgrade @root_validators in preparation for pydantic 2 migration` #25491)\n",
    "\n",
    "### 修复问题\n",
    "1. **错误修复**\n",
    "   - 修正了文档中的错别字和错误消息 (`docs: Fix typo in openai llm integration notebook` #25492, `docs: fix Agent deprecation msg` #25464)\n",
    "   - 解决了不同的搜索模式（向量与文本）产生不同结果的问题 (`Chroma search with vector and search with text get different result using the same embedding function` #25517)\n",
    "   - 修复了使用AzureSearch vectorstore时的文档ID作为键的问题 (`community : [bugfix] Use document ids as keys in AzureSearch vectorstore` #25486)\n",
    "\n",
    "2. **系统错误与异常处理**\n",
    "   - 解决了在调用特定链时缺少输入键的错误 (`Raises ValueError: Missing some input keys: {'query'} everytime I invoke 'GraphCypherQAChain.from_llm' chain with query present as input keys` #25476)\n",
    "   - 修正了未知类型 'ToolMessage' 的类型错误 (`TypeError: Got unknown type 'ToolMessage'.` #25490)\n",
    "\n",
    "---\n",
    "\n",
    "此简报详细总结了Langchain AI项目在2024年8月18日的最新进展，包括新增功能、主要改进和问题修复，确保团队成员了解最新的项目状态和即将到来的更新。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "554b5949-5982-4034-bce2-b186dedbd445",
   "metadata": {},
   "source": [
    "## 前置依赖 logger 模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f45cc627-7005-4943-92cc-212f7c98f556",
   "metadata": {},
   "outputs": [],
   "source": [
    "# src/logger.py\n",
    "from loguru import logger\n",
    "import sys\n",
    "\n",
    "# Configure Loguru\n",
    "logger.remove()  # Remove the default logger\n",
    "logger.add(sys.stdout, level=\"DEBUG\", format=\"{time} {level} {message}\", colorize=True)\n",
    "logger.add(\"logs/app.log\", rotation=\"1 MB\", level=\"DEBUG\")\n",
    "\n",
    "# Alias the logger for easier import\n",
    "LOG = logger\n",
    "\n",
    "# Make the logger available for import with the alias\n",
    "__all__ = [\"LOG\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb11497f-35fa-4024-bff2-24d4b0b3666c",
   "metadata": {},
   "source": [
    "## LLM Class "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cb4c8e66-41c5-4c60-9345-959f5bf935e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from openai import OpenAI  # 导入OpenAI库用于访问GPT模型\n",
    "# from logger import LOG  # 导入日志模块（演示时直接导入）\n",
    "\n",
    "class LLM:\n",
    "    def __init__(self, model=\"gpt-3.5-turbo\"):\n",
    "        # 创建一个OpenAI客户端实例\n",
    "        self.client = OpenAI()\n",
    "        # 确定使用的模型版本\n",
    "        self.model = model\n",
    "        # 配置日志文件，当文件大小达到1MB时自动轮转，日志级别为DEBUG\n",
    "        LOG.add(\"daily_progress/llm_logs.log\", rotation=\"1 MB\", level=\"DEBUG\")\n",
    "\n",
    "    def generate_daily_report(self, markdown_content, dry_run=False):\n",
    "        # 构建一个用于生成报告的提示文本，要求生成的报告包含新增功能、主要改进和问题修复\n",
    "        prompt = f\"以下是项目的最新进展，根据功能合并同类项，形成一份简报，至少包含：1）新增功能；2）主要改进；3）修复问题；:\\n\\n{markdown_content}\"\n",
    "        \n",
    "        if dry_run:\n",
    "            # 如果启用了dry_run模式，将不会调用模型，而是将提示信息保存到文件中\n",
    "            LOG.info(\"Dry run mode enabled. Saving prompt to file.\")\n",
    "            with open(\"daily_progress/prompt.txt\", \"w+\") as f:\n",
    "                f.write(prompt)\n",
    "            LOG.debug(\"Prompt saved to daily_progress/prompt.txt\")\n",
    "            return \"DRY RUN\"\n",
    "\n",
    "        # 日志记录开始生成报告\n",
    "        LOG.info(f\"Starting report generation using model: {self.model}.\")\n",
    "        \n",
    "        try:\n",
    "            # 调用OpenAI GPT模型生成报告\n",
    "            response = self.client.chat.completions.create(\n",
    "                model=self.model,  # 指定使用的模型版本\n",
    "                messages=[\n",
    "                    {\"role\": \"user\", \"content\": prompt}  # 提交用户角色的消息\n",
    "                ]\n",
    "            )\n",
    "            LOG.debug(f\"{self.model} response: {response}\")\n",
    "            # 返回模型生成的内容\n",
    "            return response.choices[0].message.content\n",
    "        except Exception as e:\n",
    "            # 如果在请求过程中出现异常，记录错误并抛出\n",
    "            LOG.error(\"An error occurred while generating the report: {}\", e)\n",
    "            raise\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "759d2f6d-d2a5-4524-be4f-7e94c53f07e7",
   "metadata": {},
   "source": [
    "## ReportGenerator Class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "652363c5-b211-48a9-93ae-54268e7ee13d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# src/report_generator.py\n",
    "\n",
    "import os\n",
    "from datetime import date, timedelta\n",
    "# from logger import LOG  # 导入日志模块，用于记录日志信息\n",
    "\n",
    "class ReportGenerator:\n",
    "    def __init__(self, llm):\n",
    "        self.llm = llm  # 初始化时接受一个LLM实例，用于后续生成报告\n",
    "\n",
    "    def export_daily_progress(self, repo, updates):\n",
    "        # 构建仓库的日志文件目录\n",
    "        repo_dir = os.path.join('daily_progress', repo.replace(\"/\", \"_\"))\n",
    "        os.makedirs(repo_dir, exist_ok=True)  # 如果目录不存在则创建\n",
    "        \n",
    "        # 创建并写入日常进展的Markdown文件\n",
    "        file_path = os.path.join(repo_dir, f'{date.today()}.md')\n",
    "        with open(file_path, 'w') as file:\n",
    "            file.write(f\"# Daily Progress for {repo} ({date.today()})\\n\\n\")\n",
    "            file.write(\"\\n## Issues\\n\")\n",
    "            for issue in updates['issues']:\n",
    "                file.write(f\"- {issue['title']} #{issue['number']}\\n\")\n",
    "            file.write(\"\\n## Pull Requests\\n\")\n",
    "            for pr in updates['pull_requests']:\n",
    "                file.write(f\"- {pr['title']} #{pr['number']}\\n\")\n",
    "        return file_path\n",
    "\n",
    "    def export_progress_by_date_range(self, repo, updates, days):\n",
    "        # 构建目录并写入特定日期范围的进展Markdown文件\n",
    "        repo_dir = os.path.join('daily_progress', repo.replace(\"/\", \"_\"))\n",
    "        os.makedirs(repo_dir, exist_ok=True)\n",
    "\n",
    "        today = date.today()\n",
    "        since = today - timedelta(days=days)  # 计算起始日期\n",
    "        \n",
    "        date_str = f\"{since}_to_{today}\"  # 格式化日期范围字符串\n",
    "        file_path = os.path.join(repo_dir, f'{date_str}.md')\n",
    "        \n",
    "        with open(file_path, 'w') as file:\n",
    "            file.write(f\"# Progress for {repo} ({since} to {today})\\n\\n\")\n",
    "            file.write(\"\\n## Issues Closed in the Last {days} Days\\n\")\n",
    "            for issue in updates['issues']:\n",
    "                file.write(f\"- {issue['title']} #{issue['number']}\\n\")\n",
    "            file.write(\"\\n## Pull Requests Merged in the Last {days} Days\\n\")\n",
    "            for pr in updates['pull_requests']:\n",
    "                file.write(f\"- {pr['title']} #{pr['number']}\\n\")\n",
    "        \n",
    "        LOG.info(f\"Exported time-range progress to {file_path}\")  # 记录导出日志\n",
    "        return file_path\n",
    "\n",
    "    def generate_daily_report(self, markdown_file_path):\n",
    "        # 读取Markdown文件并使用LLM生成日报\n",
    "        with open(markdown_file_path, 'r') as file:\n",
    "            markdown_content = file.read()\n",
    "\n",
    "        report = self.llm.generate_daily_report(markdown_content)  # 调用LLM生成报告\n",
    "\n",
    "        report_file_path = os.path.splitext(markdown_file_path)[0] + \"_report.md\"\n",
    "        with open(report_file_path, 'w+') as report_file:\n",
    "            report_file.write(report)  # 写入生成的报告\n",
    "\n",
    "        LOG.info(f\"Generated report saved to {report_file_path}\")  # 记录生成报告日志\n",
    "\n",
    "    def generate_report_by_date_range(self, markdown_file_path, days):\n",
    "        # 生成特定日期范围的报告，流程与日报生成类似\n",
    "        with open(markdown_file_path, 'r') as file:\n",
    "            markdown_content = file.read()\n",
    "\n",
    "        report = self.llm.generate_daily_report(markdown_content)\n",
    "\n",
    "        report_file_path = os.path.splitext(markdown_file_path)[0] + f\"_report.md\"\n",
    "        with open(report_file_path, 'w+') as report_file:\n",
    "            report_file.write(report)\n",
    "\n",
    "        LOG.info(f\"Generated report saved to {report_file_path}\")  # 记录生成报告日志\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e08d32e6-93d4-4593-a54a-51b8dffbc0e2",
   "metadata": {},
   "source": [
    "## 调用 GPT-3.5-Turbo API 生成报告"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "577d50a5-daf4-4d7d-adf4-6932af517037",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 实例化 LLM，并使用默认的 GPT-3.5-Turbo 模型\n",
    "llm = LLM()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "06114d2e-8a32-48f9-bcb8-4bb8e5460f3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 实例化 ReportGenerator\n",
    "report_generator = ReportGenerator(llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fdb2b520-6a2f-4898-94cc-55e85d58fc3c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-08-18T13:40:30.318548+0000 INFO Starting report generation using model: gpt-3.5-turbo.\n",
      "2024-08-18T13:40:32.095863+0000 DEBUG gpt-3.5-turbo response: ChatCompletion(id='chatcmpl-9xaQwudqfac0s4l6UVu9qt85SDu4S', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='## Daily Progress for langchain-ai/langchain (2024-08-18)\\n\\n### New Features\\n- Added embeddings integration tests\\n- Added langchain box package and DocumentLoader\\n\\n### Major Improvements\\n- Upgraded various components for compatibility with pydantic 2\\n\\n### Issue Fixes\\n- Fixed divide by 0 error in experimental feature\\n- Updated connection string in Azure Cosmos integration test\\n- Fixed various typos in documentation\\n- Fixed mimetype parser docstring\\n- Fixed API key checking code\\n\\nThis summarizes the latest updates and improvements made in the project.', role='assistant', function_call=None, tool_calls=None, refusal=None))], created=1723988430, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=116, prompt_tokens=619, total_tokens=735))\n",
      "2024-08-18T13:40:32.097204+0000 INFO Generated report saved to daily_progress/langchain-ai_langchain/2024-08-18_report.md\n"
     ]
    }
   ],
   "source": [
    "# 生成 LangChain 项目最近一日报告\n",
    "report_generator.generate_daily_report(\n",
    "    markdown_file_path=\"daily_progress/langchain-ai_langchain/2024-08-18.md\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "baa6d7c4-56c2-42f7-afc6-0b166e634ced",
   "metadata": {},
   "source": [
    "## 调用 GPT-4-Turbo API 生成报告"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "49620292-834f-465d-8ba5-e3bbf0045ad0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 实例化 LLM，并使用指定的 GPT-4-Turbo 模型\n",
    "gpt_4 = LLM(model=\"gpt-4-turbo\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "12429122-8bde-406f-9b9f-ec671871ab8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 实例化 ReportGenerator, 并使用指定的 GPT-4-Turbo 模型\n",
    "rg_gpt_4 = ReportGenerator(gpt_4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "3aff72bc-9324-4553-9fb9-5948b8715cd8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-08-18T13:42:25.959286+0000 INFO Starting report generation using model: gpt-4-turbo.\n",
      "2024-08-18T13:42:42.696473+0000 DEBUG gpt-4-turbo response: ChatCompletion(id='chatcmpl-9xaSosidtJbk0iJCW8fDn3P6dSraZ', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='# 简报：langchain-ai/langchain 项目进展（2024-08-18）\\n\\n## 1. 新增功能\\n- **langchain-box 包和 DocumentLoader 的加入**：新增了 `langchain-box` 包，以及其中的 `DocumentLoader` 组件，增强了文档处理功能。\\n\\n## 2. 主要改进\\n- **依赖和核心组件升级**：\\n  - 升级多个模块以适配 Pydantic 2，涉及 `@root_validator` 的使用，改进了代码的一致性和稳定性，包括但不限于 modules: core, openai, voyageai, ai21, pinecone, mistralai, fireworks, together。\\n  - 发布新版本：`core[patch]: Release 0.2.33` 和 `openai[patch]: Release 0.1.22`。\\n \\n## 3. 修复问题\\n- **文档和示例代码修复**：\\n  - 修正了多个文档中存在的问题，例如 Databricks Vector Search 演示笔记本的修复、`openai` 集成笔记本错误的调用方式（`.invoke` 替换 `__call__`）、API 引用链接修复等。\\n  - 更新了安装文档，添加了关于安装 `nltk` 和 `beautifulsoup4` 的提示。\\n- **功能性错误修正**：\\n  - 修正了一个在 `experimental` 模块中的除以零错误。\\n- **代码整合改动**：\\n  - `ContextualCompressionRetriever` 中的 `_DocumentWithState` 类被替换为更简洁的 `Document` 类，提高了代码的可读性与维护性。\\n\\n这份简报总结了项目在8月18日的关键进展，包括新功能的添加、主要组件和依赖的升级改进，以及多项问题的修复和文档的完善。持续关注项目更新可以确保团队成员和用户了解最新的产品动态和优化信息。', role='assistant', function_call=None, tool_calls=None, refusal=None))], created=1723988546, model='gpt-4-turbo-2024-04-09', object='chat.completion', system_fingerprint='fp_c77e07d4ef', usage=CompletionUsage(completion_tokens=521, prompt_tokens=619, total_tokens=1140))\n",
      "2024-08-18T13:42:42.698040+0000 INFO Generated report saved to daily_progress/langchain-ai_langchain/2024-08-18_report.md\n"
     ]
    }
   ],
   "source": [
    "# 生成 LangChain 项目最近一日报告\n",
    "rg_gpt_4.generate_daily_report(\n",
    "    markdown_file_path=\"daily_progress/langchain-ai_langchain/2024-08-18.md\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7492030c-449b-4d57-9503-4a18fa669438",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1cd2cfd8-1945-4870-b2db-f53f7e408144",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "cae5aea5-7df9-4c0d-97a8-1d80cef0cf0f",
   "metadata": {},
   "source": [
    "### Homework: 与ChatGPT深度对话，尝试使用 System role 提升报告质量和稳定性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbb42712-590f-45fc-b156-e902a3745e78",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}