如何打开谷歌网站网站备案网站

news/2025/9/24 9:36:02/文章来源:
如何打开谷歌网站,网站备案网站,网站建设移交确认书,霍邱网站设计公司文章目录 一、文档转换器 文本拆分器文本拆分器 二、开始使用文本拆分器三、按字符进行拆分四、代码分割 (Split code)1、PythonTextSplitter2、JS3、Markdown4、Latex5、HTML6、Solidity 五、MarkdownHeaderTextSplitter1、动机2、Use case 六、递归按字符分割七、按tok… 文章目录 一、文档转换器 文本拆分器文本拆分器 二、开始使用文本拆分器三、按字符进行拆分四、代码分割 (Split code)1、PythonTextSplitter2、JS3、Markdown4、Latex5、HTML6、Solidity 五、MarkdownHeaderTextSplitter1、动机2、Use case 六、递归按字符分割七、按token 进行分割1、tiktoken2、spaCy3、SentenceTransformers4、NLTK5、Hugging Face tokenizer 本文转载改编自 https://python.langchain.com.cn/docs/modules/data_connection/document_transformers/ 一、文档转换器 文本拆分器 一旦加载了文档您通常会希望对其进行转换以更好地适应您的应用程序。 最简单的例子是您可能希望将长文档拆分为更小的块以适应您模型的上下文窗口。 LangChain提供了许多内置的文档转换器使得拆分、合并、过滤和其他文档操作变得容易。 文本拆分器 当您想要处理大块文本时有必要将文本拆分为块。 虽然听起来很简单但这里存在许多 潜在的复杂性。 理想情况下您希望将 语义相关的文本片段 保持在一起。 语义相关的含义可能取决于 文本的类型。本笔记本演示了几种做法。 在高层次上文本拆分器的工作方式如下: 将文本拆分为小的、语义上有意义的块通常是句子。将这些小块组合成较大的块直到达到某个大小由某个函数测量。一旦达到该大小将该块作为自己的文本片段然后开始创建一个具有一定重叠的新文本块以保持块之间的上下文。 这意味着有两个不同的轴可以定制您的文本拆分器: 文本如何拆分块大小如何测量 二、开始使用文本拆分器 默认推荐的文本分割器是 RecursiveCharacterTextSplitter。 该文本分割器接受一个字符列表。 它尝试根据第一个字符进行分割来创建块但如果任何块太大则继续移动到下一个字符依此类推。 默认情况下它尝试进行分割的字符是 [\n\n, \n, , ] 除了控制可以进行分割的字符之外您还可以控制一些其他事项 length_function计算块长度的方法。默认只计算字符数但通常在此处传递一个令牌计数器。chunk_size块的最大大小由长度函数测量。chunk_overlap块之间的最大重叠。保持一些连续性之间可能有一些重叠例如使用滑动窗口。add_start_index是否在元数据中包含每个块在原始文档中的起始位置。 加载一段长文本 with open(../../state_of_the_union.txt) as f:state_of_the_union f.read()from langchain.text_splitter import RecursiveCharacterTextSplittertext_splitter RecursiveCharacterTextSplitter(# Set a really small chunk size, just to show.chunk_size 100,chunk_overlap 20,length_function len,add_start_index True, )texts text_splitter.create_documents([state_of_the_union]) print(texts[0]) print(texts[1])page_contentMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and metadata{start_index: 0}page_contentof Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. metadata{start_index: 82}三、按字符进行拆分 这是最简单的方法。它基于字符进行拆分默认为\n\n并通过字符数量来测量块的长度。 文本如何被拆分: 按单个字符拆分。块大小如何被测量: 通过字符数量来测量。 # This is a long document we can split up. with open(../../../state_of_the_union.txt) as f:state_of_the_union f.read()from langchain.text_splitter import CharacterTextSplitter text_splitter CharacterTextSplitter( separator \n\n,chunk_size 1000,chunk_overlap 200,length_function len, )texts text_splitter.create_documents([state_of_the_union]) print(texts[0])page_contentMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. ...He met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. lookup_str metadata{} lookup_index0如下示例传递文档的元数据信息。注意它是和文档一起拆分的。 metadatas [{document: 1}, {document: 2}] documents text_splitter.create_documents([state_of_the_union, state_of_the_union], metadatasmetadatas) print(documents[0])page_contentMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. ...From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. lookup_str metadata{document: 1} lookup_index0text_splitter.split_text(state_of_the_union)[0]Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. ...From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.四、代码分割 (Split code) CodeTextSplitter 允许您使用多种语言进行代码分割。 导入枚举 Language并指定语言。 from langchain.text_splitter import (RecursiveCharacterTextSplitter,Language, )Full list of support languages [e.value for e in Language][cpp,go,java,js,php,proto,python,rst,ruby,rust,scala,swift,markdown,latex,html,sol,]给定编程语言你也可以看到 这个语言对应的 separators RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON) [\nclass , \ndef , \n\tdef , \n\n, \n, , ]1、PythonTextSplitter 这里是使用 PythonTextSplitter 的示例 PYTHON_CODE def hello_world():print(Hello, World!)# Call the function hello_world()python_splitter RecursiveCharacterTextSplitter.from_language(languageLanguage.PYTHON, chunk_size50, chunk_overlap0 ) python_docs python_splitter.create_documents([PYTHON_CODE]) python_docs[Document(page_contentdef hello_world():\n print(Hello, World!), metadata{}),Document(page_content# Call the function\nhello_world(), metadata{})]2、JS 这里是使用 JS 文本分割器的示例 JS_CODE function helloWorld() {console.log(Hello, World!); }// Call the function helloWorld(); js_splitter RecursiveCharacterTextSplitter.from_language(languageLanguage.JS, chunk_size60, chunk_overlap0 ) js_docs js_splitter.create_documents([JS_CODE]) js_docs[Document(page_contentfunction helloWorld() {\n console.log(Hello, World!);\n}, metadata{}),Document(page_content// Call the function\nhelloWorld();, metadata{})]3、Markdown 这里是使用 Markdown 文本分割器的示例 markdown_text # ️ LangChain⚡ Building applications with LLMs through composability ⚡## Quick Installbashpip install langchainAs an open source project in a rapidly developing field, we are extremely open to contributions.md_splitter RecursiveCharacterTextSplitter.from_language(languageLanguage.MARKDOWN, chunk_size60, chunk_overlap0 ) md_docs md_splitter.create_documents([markdown_text]) md_docs[Document(page_content# ️ LangChain, metadata{}),Document(page_content⚡ Building applications with LLMs through composability ⚡, metadata{}),...Document(page_contentare extremely open to contributions., metadata{})]4、Latex 这里是使用 Latex 文本的示例 latex_text \documentclass{article}\begin{document}\maketitle\section{Introduction} Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.\subsection{History of LLMs} The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.\subsection{Applications of LLMs} LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.\end{document}latex_splitter RecursiveCharacterTextSplitter.from_language(languageLanguage.MARKDOWN, chunk_size60, chunk_overlap0 ) latex_docs latex_splitter.create_documents([latex_text]) latex_docs[Document(page_content\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle, metadata{}),Document(page_content\\section{Introduction}, metadata{}),Document(page_contentLarge language models (LLMs) are a type of machine learning, metadata{}),...Document(page_contentpsychology, and computational linguistics., metadata{}),Document(page_content\\end{document}, metadata{})]5、HTML 这里是使用 HTML 文本分割器的示例 html_text !DOCTYPE html htmlheadtitle️ LangChain/titlestylebody {font-family: Arial, sans-serif;}h1 {color: darkblue;}/style/headbodydivh1️ LangChain/h1p⚡ Building applications with LLMs through composability ⚡/p/divdivAs an open source project in a rapidly developing field, we are extremely open to contributions./div/body /htmlhtml_splitter RecursiveCharacterTextSplitter.from_language(languageLanguage.MARKDOWN, chunk_size60, chunk_overlap0 ) html_docs html_splitter.create_documents([html_text]) html_docs[Document(page_content!DOCTYPE html\nhtml\n head, metadata{}),Document(page_contenttitle️ LangChain/title\n style, metadata{}),Document(page_contentbody {, metadata{}),Document(page_contentfont-family: Arial, sans-serif;, metadata{}),Document(page_content}\n h1 {, metadata{}),Document(page_contentcolor: darkblue;\n }, metadata{}),Document(page_content/style\n /head\n body\n div, metadata{}),Document(page_contenth1️ LangChain/h1, metadata{}),Document(page_contentp⚡ Building applications with LLMs through, metadata{}),Document(page_contentcomposability ⚡/p, metadata{}),Document(page_content/div\n div, metadata{}),Document(page_contentAs an open source project in a rapidly, metadata{}),Document(page_contentdeveloping field, we are extremely open to contributions., metadata{}),Document(page_content/div\n /body\n/html, metadata{})]6、Solidity 这里是使用 Solidity 文本分割器的示例 SOL_CODE pragma solidity ^0.8.20; contract HelloWorld {function add(uint a, uint b) pure public returns(uint) {return a b;} } sol_splitter RecursiveCharacterTextSplitter.from_language(languageLanguage.SOL, chunk_size128, chunk_overlap0 ) sol_docs sol_splitter.create_documents([SOL_CODE]) sol_docs[Document(page_contentpragma solidity ^0.8.20;, metadata{}),Document(page_contentcontract HelloWorld {\n function add(uint a, uint b) pure public returns(uint) {\n return a b;\n }\n}, metadata{}) ]五、MarkdownHeaderTextSplitter 1、动机 许多聊天或问答应用程序在嵌入和向量存储之前会先对输入文档进行分割成块。 Pinecone 的这些笔记提供了一些有用的提示 当嵌入整个段落或文档时嵌入过程会同时考虑整体上下文和文本中句子和短语之间的关系。这可能会得到更全面的向量表示捕捉文本的更广泛的含义和主题。正如上面所述分块通常旨在将具有共同上下文的文本保持在一起。 在这种情况下我们可能想要特别尊重文档本身的结构。 例如一个 Markdown 文件的组织方式是通过标题。 在特定的标题组内创建分块是一个直观的想法。 为了解决这个挑战我们可以使用 MarkdownHeaderTextSplitter。 它将按照指定的一组标题来分割一个 Markdown 文件。 例如如果我们想要分割这个 Markdown md # Foo\n\n ## Bar\n\nHi this is Jim \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly 我们可以指定要分割的标题 [(#, Header 1),(##, Header 2)]然后根据公共标题进行内容的分组或分割 {content: Hi this is Jim \nHi this is Joe, metadata: {Header 1: Foo, Header 2: Bar}} {content: Hi this is Molly, metadata: {Header 1: Foo, Header 2: Baz}}让我们来看一些下面的示例。 from langchain.text_splitter import MarkdownHeaderTextSplittermarkdown_document # Foo\n\n ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Mollyheaders_to_split_on [(#, Header 1),(##, Header 2),(###, Header 3), ]markdown_splitter MarkdownHeaderTextSplitter(headers_to_split_onheaders_to_split_on) md_header_splits markdown_splitter.split_text(markdown_document) for split in md_header_splits:print(split){content: Hi this is Jim \nHi this is Joe, metadata: {Header 1: Foo, Header 2: Bar}} {content: Hi this is Lance, metadata: {Header 1: Foo, Header 2: Bar, Header 3: Boo}} {content: Hi this is Molly, metadata: {Header 1: Foo, Header 2: Baz}}在每个 markdown 组中我们可以应用我们需要的 text splitter。 markdown_document # Intro \n\n ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages.headers_to_split_on [(#, Header 1),(##, Header 2), ]# MD splits markdown_splitter MarkdownHeaderTextSplitter(headers_to_split_onheaders_to_split_on) md_header_splits markdown_splitter.split_text(markdown_document)# Char-level splits from langchain.text_splitter import RecursiveCharacterTextSplitter chunk_size 10 chunk_overlap 0 text_splitter RecursiveCharacterTextSplitter(chunk_sizechunk_size, chunk_overlapchunk_overlap)# Split within each header group all_splits[] all_metadatas[] for header_group in md_header_splits:_splits text_splitter.split_text(header_group[content])_metadatas [header_group[metadata] for _ in _splits]all_splits _splitsall_metadatas _metadatasall_splits[0] # - Markdown[9all_metadatas[0] # - {Header 1: Intro, Header 2: History}2、Use case 我们将 MarkdownHeaderTextSplitter 应用到 Notion page 作为测试。详情可见https://rlancemartin.notion.site/Auto-Evaluation-of-Metadata-Filtering-18502448c85240828f33716740f9574b 这个页面使用 markdown 下载保存到本地。 # Load Notion database as a markdownfile file from langchain.document_loaders import NotionDirectoryLoader loader NotionDirectoryLoader(../Notion_DB_Metadata) docs loader.load() md_filedocs[0].page_content# Lets create groups based on the section headers headers_to_split_on [(###, Section), ] markdown_splitter MarkdownHeaderTextSplitter(headers_to_split_onheaders_to_split_on) md_header_splits markdown_splitter.split_text(md_file) md_header_splits[3]{content: We previously introduced [auto-evaluator](https://blog.langchain.dev/auto-evaluator-opportunities/), an open-source tool for grading LLM question-answer chains. Here, we extend auto-evaluator with a [lightweight Streamlit app](https://github.com/langchain-ai/auto-evaluator/tree/main/streamlit) that can connect to any existing Pinecone index. We add the ability to test metadata filtering using SelfQueryRetriever as well as some other approaches that we’ve found to be useful, as discussed below. \n[ret_trim.mov](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/ret_trim.mov),metadata: {Section: Evaluation}}现在我们将文本拆分到每个组中并将该组作为元数据保存。 # Define our text splitter from langchain.text_splitter import RecursiveCharacterTextSplitter chunk_size 500 chunk_overlap 50 text_splitter RecursiveCharacterTextSplitter(chunk_sizechunk_size, chunk_overlapchunk_overlap)# Create splits within each header group all_splits[] all_metadatas[] for header_group in md_header_splits:_splits text_splitter.split_text(header_group[content])_metadatas [header_group[metadata] for _ in _splits]all_splits _splitsall_metadatas _metadatasall_splits[6]In these cases, semantic search will look for the concept episode 53 in the chunks, but instead we simply want to filter the chunks for episode 53 and then perform semantic search to extract those that best summarize the episode. Metadata filtering does this, so long as we 1) we have a metadata filter for episode number and 2) we can extract the value from the query (e.g., 54 or 252) that we want to extract. The LangChain SelfQueryRetriever does the latter (seeall_metadatas[6]{Section: Motivation}这使我们能够很好地执行 基于文档结构的 元数据过滤。 让我们先建一个向量库把这一切结合起来。 ! pip install chromadb# Build vectorstore from langchain.vectorstores import Chroma from langchain.embeddings.openai import OpenAIEmbeddings embeddings OpenAIEmbeddings() vectorstore Chroma.from_texts(textsall_splits,metadatasall_metadatas,embeddingOpenAIEmbeddings())我们创建一个 SelfQueryRetriever可以根据我们定义的元数据进行筛选。 # Create retriever from langchain.llms import OpenAI from langchain.retrievers.self_query.base import SelfQueryRetriever from langchain.chains.query_constructor.base import AttributeInfo# Define our metadata metadata_field_info [AttributeInfo(nameSection,descriptionHeaders of the markdown document that organize the ideas,typestring or list[string],), ] document_content_description Headers of the markdown document# Define self query retriver llm OpenAI(temperature0) sq_retriever SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verboseTrue)然后我们可以从 文章的任意部分获取 chunks。 # Test questionSummarize the Introduction section of the document sq_retriever.get_relevant_documents(question)queryIntroduction filterComparison(comparatorComparator.EQ: eq, attributeSection, valueIntroduction) limitNone[Document(page_content![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled.png), metadata{Section: Introduction}),Document(page_contentQA systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. ... Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using, metadata{Section: Introduction}),Document(page_contenton a user-defined criteria in a VectorDB using metadata tags prior to semantic search., metadata{Section: Introduction})]现在我们可以创建清洗的文档结构的 聊天或QA 应用程序。 当然没有特定元数据过滤的语义搜索可能对这个简单的文档 相当有效。 但是对于更复杂或更长的文档保留文档结构 以进行 元数据过滤的能力 可能会有所帮助。 from langchain.chains import RetrievalQA from langchain.chat_models import ChatOpenAI llm ChatOpenAI(model_namegpt-3.5-turbo, temperature0) qa_chain RetrievalQA.from_chain_type(llm,retrieversq_retriever) qa_chain.run(question)queryIntroduction filterComparison(comparatorComparator.EQ: eq, attributeSection, valueIntroduction) limitNoneThe document discusses different approaches to retrieve relevant text chunks and synthesize them into an answer in QA systems. ... The Retriever-Less option, which uses the Anthropic 100k context window model, is also mentioned as an alternative approach.questionSummarize the Testing section of the document qa_chain.run(question)queryTesting filterComparison(comparatorComparator.EQ: eq, attributeSection, valueTesting) limitNoneThe Testing section of the document describes how the performance of the SelfQueryRetriever was evaluated using various test cases. ... Additionally, the document mentions the use of the Kor library for structured data extraction to explicitly specify transformations that the auto-evaluator can use.六、递归按字符分割 这个文本分割器 是用于 通用文本的推荐分割器。它通过一个 字符列表进行参数化。 它会按 顺序 尝试使用这些字符进行分割直到块的大小足够小。 默认列表是 [\n\n, \n, , ]。 这样做的效果是尽可能地保持所有段落然后是句子然后是单词在一起因为它们通常是 在语义上相关的文本片段中 的最强关联部分。 文本如何分割按字符列表。块的大小如何衡量按字符数。 This is a long document we can split up. with open(../../../state_of_the_union.txt) as f:state_of_the_union f.read()from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter RecursiveCharacterTextSplitter(# Set a really small chunk size, just to show.chunk_size 100,chunk_overlap 20,length_function len, )texts text_splitter.create_documents([state_of_the_union]) print(texts[0]) print(texts[1])page_contentMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and lookup_str metadata{} lookup_index0page_contentof Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. lookup_str metadata{} lookup_index0text_splitter.split_text(state_of_the_union)[:2][Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and,of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.]七、按token 进行分割 语言模型有一个token限制。您不应超过token限制。 因此当您将文本分割成块时将token的数量进行计数是一个好主意。 有许多分词器可供使用。在计数文本中的token时应使用 与语言模型 中使用的相同的分词器。 1、tiktoken tiktoken 是由 OpenAI 创建的高速BPE分词器。 我们可以使用它来估计已使用的token。对于 OpenAI 模型它可能更准确。 文本的分割方式通过传入的字符进行分割分块大小的衡量标准使用 tiktoken 分词器计数 安装 tiktoken !pip install tiktoken# This is a long document we can split up. with open(../../../state_of_the_union.txt) as f:state_of_the_union f.read() from langchain.text_splitter import CharacterTextSplittertext_splitter CharacterTextSplitter.from_tiktoken_encoder(chunk_size100, chunk_overlap0 ) texts text_splitter.split_text(state_of_the_union)texts[0] Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution.也可以直接 load 一个 tiktoken splitter from langchain.text_splitter import TokenTextSplittertext_splitter TokenTextSplitter(chunk_size10, chunk_overlap0)texts text_splitter.split_text(state_of_the_union) print(texts[0])2、spaCy spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. 另一个替代NLTK 的是 spaCy tokenizer. How the text is split: by spaCy tokenizerHow the chunk size is measured: by number of characters !pip install spacy# This is a long document we can split up. with open(../../../state_of_the_union.txt) as f:state_of_the_union f.read()from langchain.text_splitter import SpacyTextSplittertext_splitter SpacyTextSplitter(chunk_size1000)texts text_splitter.split_text(state_of_the_union)texts[0] Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.Members of Congress and the Cabinet. ...From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.3、SentenceTransformers SentenceTransformersTokenTextSplitter 是一个专门用于 sentence-transformer 模型 的文本拆分器。 默认行为是 将文本拆分为 适合您想要使用的 sentence transformer 模型的标记窗口的块。 from langchain.text_splitter import SentenceTransformersTokenTextSplittersplitter SentenceTransformersTokenTextSplitter(chunk_overlap0) text Lorem count_start_and_stop_tokens 2 text_token_count splitter.count_tokens(texttext) - count_start_and_stop_tokens print(text_token_count) # 2token_multiplier splitter.maximum_tokens_per_chunk // text_token_count 1# text_to_split does not fit in a single chunk text_to_split text * token_multiplier # 514print(ftokens in text to split: {splitter.count_tokens(texttext_to_split)})tokens in text to split: 514text_chunks splitter.split_text(texttext_to_split)print(text_chunks[1]) # lorem4、NLTK The Natural Language Toolkit, 或更被知道为 NLTK, 是一套用Python编程语言编写的 用于英语符号和统计自然语言处理NLP的库和程序。 在使用 “\n\n” 分割的基础上, 我们使用 NLTK 的 NLTK tokenizers 来分割。 文本如何被分割: 使用 NLTK tokenizer.块大小如何计算按 characters 数 # pip install nltk# This is a long document we can split up. with open(../../../state_of_the_union.txt) as f:state_of_the_union f.read()from langchain.text_splitter import NLTKTextSplittertext_splitter NLTKTextSplitter(chunk_size1000)texts text_splitter.split_text(state_of_the_union) print(texts[0])Madam Speaker, Madam Vice President, our First Lady and Second Gentleman....Groups of citizens blocking tanks with their bodies.5、Hugging Face tokenizer Hugging Face 有很多 tokenizers。 我们使用 Hugging Face tokenizer, GPT2TokenizerFast 来计算tokens 中的文本长度。 文本如何分割: by character passed in块大小如何计算: 通过 Hugging Face tokenizer计算的 tokens 数量。 from transformers import GPT2TokenizerFast from langchain.text_splitter import CharacterTextSplittertokenizer GPT2TokenizerFast.from_pretrained(gpt2)# This is a long document we can split up. with open(../../../state_of_the_union.txt) as f:state_of_the_union f.read()text_splitter CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size100, chunk_overlap0 ) texts text_splitter.split_text(state_of_the_union)texts[0] Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. ...With a duty to one another to the American people to the Constitution.2024-04-08一

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/915503.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基金网站建设网站建设运营知识

移动应用程序开发的增长速度比以往任何时候都快。几乎每个企业都需要移动应用程序来保持市场竞争力。由于像 React Native 这样的跨平台移动应用程序开发框架允许公司使用单一源代码和单一编程语言构建 iOS 和 Android 应用程序, Flutter是 Google 支持的另一个热门…

BilldDesk:基于Vue3+WebRTC+Nodejs+Electron的开源远程桌面控制 - 详解

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

css-轮播图效果

<!DOCTYPE html> <html lang="zh-EN"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"&g…

aspnetcore使用websocket实时更新商品信息

aspnetcore使用websocket实时更新商品信息先演示一下效果,再展示代码逻辑。中间几次调用过程省略。。。 暂时只用到了下面四个项目1.产品展示页面中第一次通过接口去获取数据库的列表数据/// <summary> /// 获取…

漏洞挖掘实战:如何定制化模糊测试技术

本文深入探讨如何定制化模糊测试工具syzkaller来挖掘Linux内核漏洞。从基础架构解析到实战技巧,涵盖权限设置、网络接口测试、结果筛选机制以及七种独特漏洞发现方法,适合安全研究人员参考。适配模糊测试以挖掘漏洞 …

css-遮罩层效果

<!DOCTYPE html> <html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0">&…

nuxt3中使用pdfjs-dist实现pdf转换canvas实现浏览

获取 pdfjsLib.GlobalWorkerOptions.workerSrc 的cdn链接地址https://cdnjs.com/libraries/pdf.js 代码 https://files.cnblogs.com/files/li-sir/cspdf.zip?t=1758676920&download=true

查看linux部署网站的TLS版本号

curl https://域名 -version无可奈何花落去,似曾相识燕归来

【SpringBoot- Spring】学习

Spring官方文档翻译(1~6章 转载至 http://blog.csdn.net/tangtong1/article/details/51326887 Spring官方文档、参考中文文档 一、Spring框架概述 Spring框架是一个轻量级的解决方案,可以一站式地构建企业级应用。Sp…

css-更改鼠标样式

<!DOCTYPE html> <html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0">&…

css-浮动围绕文字效果

<!DOCTYPE html> <html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0">&…

浙江省建设厅网站地址网页游戏排行榜 2020

文章目录 前言不使用对象池使用官方内置对象池应用 自制对象池总结源码参考完结 前言 对象池&#xff08;Object Pool&#xff09;是一种软件设计模式&#xff0c;用于管理和重用已创建的对象。在对象池中&#xff0c;一组预先创建的对象被维护在一个池中&#xff0c;并在需要时…

怎样建设电影网站找人做一个小网站需要多少钱

当下&#xff0c;新媒体矩阵营销已成为众多企业的营销选择之一&#xff0c;各企业可以通过新媒体矩阵实现扩大品牌声量、维持用户关系、提高销售业绩等不同的目的。 而不同目的的矩阵&#xff0c;它的内容运营模式会稍有差别&#xff0c;评价体系也会大不相同。 企业在运营某类…

网站建设与实训怎么给网站引流

在大型语言模型&#xff08;LLM&#xff09;的世界中&#xff0c;有两个强大的框架用于部署和服务LLM&#xff1a;vLLM 和 Text Generation Interface (TGI)。这两个框架都有各自的优势&#xff0c;适用于不同的使用场景。在这篇博客中&#xff0c;我们将对这两个框架进行详细的…

连江网站建设c 语言网站建设

Zotero有着强大的文献管理功能&#xff0c;之前也对其进行过简要介绍&#xff08;Zotero——一款文献管理工具&#xff09;&#xff0c;而安装一些必要的插件则可以使其如虎添翼&#xff0c;今天一起来探索一下一些实用的插件吧&#xff01;&#xff08;排名不分先后&#xff0…

怎样做访问外国网站才能不卡iis部署网站 错误400

一、axios Axios 是一个基于 promise 网络请求库&#xff0c;作用于node.js 和浏览器中。 它是 isomorphic 的(即同一套代码可以运行在浏览器和node.js中)。在服务端它使用原生 node.js http 模块, 而在客户端 (浏览端) 则使用 XMLHttpRequests。 二、配置代理 1. 方法一 在…

按照DDD的方式写的一个.net有关Web项目框架

按照DDD的方式写的一个.net有关Web项目框架理想很丰满,现实往往很残酷。 一种按照ddd的方式,根据业务来把自己需要的模块一个一个写出来,再按照模块把需要的接口一个一个的写出来,堆砌一些中间件,以及解耦的comma…

css-图片文字对齐方式

<!DOCTYPE html> <html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0">&…

基于Python+Vue开发的摄影网上预约管理系统源码+运行步骤

项目简介该项目是基于Python+Vue开发的摄影网上预约管理系统(前后端分离),影楼婚纱摄影,这是一项为大学生课程设计作业而开发的项目。该系统旨在帮助大学生学习并掌握Python编程技能,同时锻炼他们的项目设计与开发…

【习题答案】《深入理解计算机系统(原书第三版)》

第一章 计算机系统漫游考察Amdahl 定律【练习题 1.1】 假设你是个卡车司机,要将土豆从爱达荷州的 Boise 运送到明尼苏达州的 Minneapolis, 全程 2500 公里。在限速范围内,你估计平均速度为 100 公里/小时,整个行程需…