Natural Language Processing

news/2025/9/26 1:59:01/文章来源:https://www.cnblogs.com/xcyle/p/19112397

NLP before LLM

Context-free Grammar

A context-free grammar (CFG) contains a set of production rules, which are rules saying how each nonterminal can be replaced by a string of terminals and non-terminals. Derivations are a sequence of steps where non-terminals are replaced by the right-hand side of a production rule in the CFG.

Given \(G\) as a CFG. The language of \(G\), denoted as \(\mathcal{L}(G)\), is the set of strings derivable by \(G\) (from the start symbol). A language \(L\) is called a context-free language (CFL) if there is a CFG \(G\) such that \(L=\mathcal{L}(G)\).

Theorem. Every regular language (regular expressions, regex), is context-free, but not vice versa.

Example \(L_0\): Natural Language.

Clearly, CFG is (usually) not perfect: \(L_0\) could cover more than the language you really want, but it is already useful for parsing.

A CFG is in Chomsky normal form (CNF) if it is \(\varepsilon\)-free and if in addition Chomsky normal form each production is either of the form \(A\to B\;C\) or \(A\to a\). (\(A\; B\; C\) can be the same symbol)

Theorem. Any context-free grammar can be converted into a CNF.

Given a CFG, syntactic parsing refers to the problem of mapping from a sentence to its parse tree. We can use dynamic programming to parse a sentence (CKY parsing). However, there coult be ambiguity. To resolve ambiguity, we can use probabilistic CFG (PCFG), which associates a probability with each production rule. The probability of a parse tree is the product of the probabilities of all the production rules used in the tree.

Latent Semantic Analysis

The term-document (TD) matrix \(W_{td}\in\mathbb{R}^{m\times n}\), where \(m\) is the number of terms and \(n\) is the number of documents. Each entry \(W_{td}(ij)\) is the frequency of term \(i\) in document \(j\). i.e., rows are terms and columns are documents.

Since the co-occurrence statistics might be sparse, instead of directly using ܹ\(W_{td}\), we want to infer a "latent" vector representation (row vec) for the words/documents which satisfies:

\[W_{td}(ij)\approx \max(lv(w_i)lv(d_j)^\top,0). \]

To achieve this, we can use SVD. Besides, in order to make related words have related vectors, we need to use low-rank approximation. That is, we compute the SVD of \(W_{td}\) and keep only the top \(k\) singular values and corresponding singular vectors.

Another problem is that SVD would pay too much attention to the high-freq words. So we apply TF-IDF normalization. That is, we replace \(W_{td}(ij)\) by \(tf(i,j)\cdot idf(i)\), where

  • Term frequency (\(tf(i,j)\)):

\[\frac{\text{# of times word }i\text{ appears in doc }j}{\text{# of words in doc }j}. \]

  • Inverse document frequency (\(idf(i)\)), smoothed version:

\[\log \left(\frac{\text{# of docs} + 1}{\text{# of docs containing word }i + 1}\right)+1. \]

Hidden Markov Model

Gaussian Mixture Model

Consider a mixture of \(k\) Gaussian distributions. Each Gaussian has its own mean \(\mu_c\), variance \(\sigma_c\), and mixture weight \(\pi_c\). The probability density function of the mixture is

\[p(x)=\sum_{c=1}^k \pi_c \mathcal{N}(x;\mu_c,\sigma_c). \]

Equivalent "latent variable" form:

\[\begin{align*} p(z=c)&=\pi_c\\ p(x|z=c)&=\mathcal{N}(x;\mu_c,\sigma_c). \end{align*} \]

Given a dataset \(X\) that is i.i.d. drawn from GMM, we want to estimate the parameters \(\theta=\{\pi_c,\mu_c,\sigma_c\}_{c=1}^k\).

The Expectation-Maximization Algorithm

The EM algorithm is used to find maximum likelihood parameters of a statistical model. Formally, we want to optimize \(\theta\) to maximize the log-likelihood of the observed data:

\[\log p(X|\theta)=\log \sum_Z p(X,Z|\theta). \]

Consider any \(q\) distribution, applying Jensen's inequality gives

\[\begin{align*} \log \sum_Z p(X,Z|\theta)=&\ \log \sum_Z q(Z)\frac{p(X,Z|\theta)}{q(Z)}\\ \ge&\ \sum_Z q(Z)\log p(X,Z|\theta)+\text{H}(q), \end{align*} \]

where \(\text{H}(q)=-\sum_Z q(Z)\log q(Z)\) is the entropy of \(q\).

Denote the current parameter as \(\theta'\). Set \(q(Z)=p(Z|X,\theta')\) and let the above objective be \(Q(\theta|\theta')\). One can verify that

\[\log p(X|\theta)= Q(\theta|\theta')+\text{KL}(q(Z)||p(Z|X,\theta)). \]

When \(\theta'\) is close to \(\theta\), the KL divergence is small, and maximizing \(Q(\theta|\theta')\) is approximately maximizing \(\log p(X|\theta)\).

The EM algorithm iteratively performs the following two steps until convergence. In the \(k\)-th iteration:

E-step: Let \(\theta_k\) be the current parameter estimate. Compute $$q(Z)=p(Z|X,\theta_k)=\frac{p(Z,X|\theta_k)}{p(X|\theta_k)}.$$

For GMM, we have \(\theta=\{\pi_c,\mu_c,\sigma_c\}_{c=1}^k\), so when \(Z=c,X=x_i\), it's

\[r_{ic}=\frac{\pi_{c}\mathcal{N}(x_i;\mu_{c},\sigma_{c})}{\sum_{c'} \pi_{c'} \mathcal{N}(x_i;\mu_{c'},\sigma_{c'})}. \]

M-step: We fix \(q(Z)\) and optimize \(\theta\) to maximize \(Q(\theta|\theta_k)\). That is, minimize

\[\sum_i\sum_c r_{ic}\log \left(\pi_c\mathcal{N}(x_i;\mu_c,\sigma_c)\right). \]

Let \(m_c=\sum_i r_{ic}\) and \(m=\sum_c m_c\). Solving the optimization problem gives

\[\begin{align*} \pi_c=&\ \frac{m_c}{m},\\ \mu_c=&\ \frac{1}{m_c}\sum_i r_{ic} x_i,\\ \sigma_c=&\ \frac{1}{m_c}\sum_i r_{ic}(x_i-\mu_c)^2. \end{align*} \]

Hidden Markov Model

Given a sentence, we want to identify the grammatical category of each word. We can use HMM to model the problem. Let \(O\) be the observed words, and \(Q\) be the hidden states (grammatical categories). The model parameters include:

  • Initial state distribution: \(\pi_i=p(q_1=i)\);
  • State transition distribution: \(a_{ij}=p(q_{t+1}=j|q_t=i)\);
  • Emission distribution: \(b_i(o_t)=p(o_t|q_t=i)\).

If we're given the parameters \(A,B,\pi\), we can calculate the probability of an observation sequence \(O\) using the forward algorithm, and we can find the most likely hidden state sequence \(Q\) using the Viterbi algorithm. They both use dynamic programming.

Thus, for supervised learning, we just need to count & normalize; for unsupervised learning, we can use the EM algorithm with \(Q\) as hidden states.

N-gram

A Language Model assigns a probability of any sequence of words. That is,

\[\sum_{W\in V^*} p(W)=1. \]

A good LM should put high probability to more "likely" sentences.

  • Uni-gram LM: \(P(w_1w_2\cdots w_T)=P(w_1)P(w_2)\cdots P(w_T).\)
  • Bi-gram LM: \(P(w_1w_2\cdots w_T)=P(w_1)P(w_2|w_1)P(w_3|w_2)\cdots P(w_T|w_{T-1}).\)
  • Tri-gram LM: \(P(w_1w_2\cdots w_T)=P(w_1)P(w_2|w_1)P(w_3|w_1w_2)\cdots P(w_T|w_{T-2}w_{T-1}).\)
  • N-gram LM: \(P(w_1w_2\cdots w_T)=\prod_{t=n}^T P(w_t|w_{t-(n-1)}\cdots w_{t-1}).\)

There are two special tokens that are very useful in building a LM.

  • The end-of-sentence token: <eos>, tells where a sentence could end.
  • The out-of-vocabulary token: <unk>, replaces rare words (e.g., appearing only once in the training corpus).

In order to build a LM, we need to estimate the conditional probabilities \(P(w_t|w_{t-(n-1)}\cdots w_{t-1})\). The straightforward way is to directly count & normalize. However, this would lead to the zero-frequency problem (cannot create new sentences). Ways to solve it:

  • Add-k smoothing: $$P(w_t|w_{t-(n-1)}\cdots w_{t-1})=\frac{\text{count}(w_{t-(n-1)}\cdots w_{t-1}w_t)+k}{\text{count}(w_{t-(n-1)}\cdots w_{t-1})+k|V|}.$$
  • Linear interpolation: $$P(w_t|w_{t-(n-1)}\cdots w_{t-1})=\lambda_n P(w_t|w_{t-(n-1)}\cdots w_{t-1})+\lambda_{n-1} P(w_t|w_{t-(n-2)}\cdots w_{t-1})+\cdots+\lambda_1 P(w_t),$$
    where \(\sum_{i=1}^n \lambda_i=1\).
  • Backoff: $$P(w_t|w_{t-(n-1)}\cdots w_{t-1})=\begin{cases} \frac{\text{count}(w_{t-(n-1)}\cdots w_{t-1}w_t)}{\text{count}(w_{t-(n-1)}\cdots w_{t-1})}, & \text{if } \text{count}(w_{t-(n-1)}\cdots w_{t})>0 \ \alpha(w_{t-(n-2)}\cdots w_{t-1})P(w_t|w_{t-(n-2)}\cdots w_{t-1}), & \text{otherwise} \end{cases},$$
    where \(\alpha(w_{t-(n-2)}\cdots w_{t-1})\) is a normalization factor.

Given a test set \(W\), we define the LM's perplexity to be

\[PPL(W)=2^{-l},\text{ where }l=\frac{\log_2(P(W))}{\text{token_len}(W)}. \]

Perplexity cares more about diversity than quality.

Word2vec

In the LSA section, we obtained word vectors by decomposing the term-document matrix; In this section, we will talk about another approach based on prediction.

  • Skip-gram: learn representations that predict the context given a word.
  • CBOW (Continous Bag-of-Words): learn representations that predict a word given context.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/917779.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

东莞做企业宣传网站wordpress调用post的发布时间

1.基本使用 2.参数和返回值 函数里只有一行代码 log没有返回值&#xff0c;所以是undefined 结果为 Hello Demo undefined 箭头函数&#xff1a; 在函数作为另外一个函数参数的时候&#xff0c;适合用箭头函数 3.箭头函数的this 返回值都是window 返回值&#xff0c;一…

美橙互联建站网页制作基础教程第二章

Configuration 作用 Configuration 注解的核心作用是把一个类标记为 Spring 应用上下文里的配置类。配置类就像一个 Java 版的 XML 配置文件&#xff0c;能够在其中定义 Bean 定义和 Bean 之间的依赖关系。当 Spring 容器启动时&#xff0c;会扫描这些配置类&#xff0c;解析其…

网站上传不了北海网站建设服务商

一、OCPP协议介绍 OCPP的全称是 Open Charge Point Protocol 即开放充电点协议&#xff0c; 它是免费开放的协议&#xff0c;该协议由位于荷兰的组织 OCA&#xff08;开放充电联盟&#xff09;进行制定。Open Charge Point Protocol (OCPP) 开放充电点协议用于充电站(CS)和任何…

中国太空网站做网站还是网页设计

1.理解用户级线程 我们前面用到的所有跟线程有关的接口全部都不是系统直接提供的接口&#xff0c;而是原生线程库pthread提供的接口。我们前面谈到了由于用户只认线程&#xff0c;而linux操作系统是通过用轻量级进程模拟线程&#xff0c;并不是真正的线程&#xff0c;所以linu…

南京华典建设有限公司网站专门做旅游的网站有哪些

Broker内存映射机制与高效磁盘 RocketMQ在存储涉及中通过内存映射、顺序写文件等方式实现了高吞吐。 RocketMQ的基本数据结构: CommitLog:RocketMQ对存储消息的物理文件的抽象实现&#xff0c;也就是对物理CommitLog文件的具体实现。MappedFile:CommitLog文件在内存中的映射文…

邢台网站制作哪里做什么是互联网公司

jar命令格式&#xff1a;jar {c t x u f }[ v m e 0 M i ][-C 目录]文件名其中{ctxu}这四个参数必须选选其一。[v f m e 0 M i ]是可选参数&#xff0c;文件名也是必须的。所有的参数说明&#xff1a;-c 创建一个jar包-t 显示jar中的内容列表-x 解压jar包-u 添加文件到jar包中-…

Python 在自动化与运维中的价值与实践

一、引言 ⚡ 在信息化时代,自动化与运维已经成为企业 IT 基础设施的核心组成部分。从服务器管理到应用部署,从日志分析到故障排查,自动化能够显著提升效率,降低人工操作的失误率。Python 作为脚本语言起家,凭借其…

行政机关单位网站建设要求房地产市场低迷

目录 Redis - 概述 使用场景 如何安装 Window 下安装 Linux 下安装 docker直接进行安装 下载Redis镜像 Redis启动检查常用命令 Redis - 概述 redis是一款高性能的开源NOSQL系列的非关系型数据库,Redis是用C语言开发的一个开源的高键值对(key value)数据库,官方提供测试…

Postgresql17增量备份demo

#include <iostream> #include <string> #include <vector> #include <filesystem> #include <chrono> #include <iomanip> #include <sstream> #include <cstdlib> …

Nodejs install

C compiler installsudo apt-get update sudo apt-get install build-essentialdownload source codetar xf node-v22.19.0.tar.xz cd node-v22.19.0 sudo ./configure sudo make sudo make installnode --version

泉州网站关键词排名做网站什么主题好做

---恢复内容开始--- 操作标签 样式操作 样式类 addClass();//添加指定的CSS类名。 removeClass();//移除指定的类名. hasClass();//判断样式不存在 toggleClass();//切换css类名&#xff0c;如果有就移除&#xff0c;如果没有就添加 示例&#xff1a;开关灯和模态框 CSS css(&q…

连云港做网站制作首选公司seo网站优化推广怎么样

问题&#xff1a; 因为要测试一些东西&#xff0c;所以必须有中文数据来做支撑&#xff0c;之前用的架构是x86&#xff0c;现在一个服务器的架构为arrch64&#xff0c;下列编码都挨个都进行声明&#xff0c;但是无法解决问题&#xff0c;总是报错 # -*- coding: gbk -*- # -*…

河南省建设安全监督站的网站做章网站

学习目的 Boost 的学习目的&#xff1a; 因为从知乎和CSND上根据了解内容来看&#xff0c;Boost作为一个历史悠久的开源库&#xff0c;已经脱离了一个单纯的库的概念了&#xff0c;他因庞大的涉及面应当被称之为库集。 并且&#xff0c;因为boost库优秀的试用反馈和开发人员的…

工信部申诉备案网站电脑显示无法运行wordpress

DiffBIR 发表于2023年的ICCV&#xff0c;是一种基于生成扩散先验的盲图像恢复模型。它通过两个阶段的处理来去除图像的退化&#xff0c;并细化图像的细节。DiffBIR 的优势在于提供高质量的图像恢复结果&#xff0c;并且具有灵活的参数设置&#xff0c;可以在保真度和质量之间进…

网站集约化建设探讨广告制作与设计专业

文章目录 1. Unsafe Filedownload1.1 Unsafe Filedownload1.1.1 源代码分析1.1.2 漏洞防御 1.2 不安全的文件下载防御措施 1. Unsafe Filedownload 不安全的文件下载概述&#xff1a; 文件下载功能在很多web系统上都会出现&#xff0c;一般我们当点击下载链接&#xff0c;便会…

尚品本色木门网站是哪个公司做的ui培训班多少钱

1、报文首部 HTTP协议的请求和响应必定包含HTTP首部&#xff0c;它包括了客户端和服务端分别处理请求和响应提供所需要的信息。报文主体字儿是所需要的用户和资源的信息都在这边。  HTTP请求报文组成 方法&#xff0c;URL&#xff0c;HTTP版本&#xff0c;HTTP首部字段 HTTP响…

ipad怎么制作网站艺术品商城网站开发

目录 1. loading 提示框 1. 1 wx.showLoading()显示loading提示框 1.2 wx.hideLoading()关闭 loading 提示框 2. showModal 模态对话框 3. showToast 消息提示框 小程序提供了一些用于界面交互的 API&#xff0c;例如&#xff1a;loading 提示框、消息提示框、模态对…

国外做详情页网站广州市住房建设公租房网站

sion Pro即将于2月2日正式在美国商场开始交给&#xff0c;苹果美国官网释出了Vision Pro的详细参数&#xff0c;与发布会介绍根本一致&#xff0c;依靠总计12个摄像头、5种传感器、职业顶尖的单眼4K分辨率Micro-OLED显示屏、M2与R1芯片&#xff0c;完成了当时商场上独一无二的沉…

网站英文域名怎么查建设部网站查询公司

注意这个json格式不对前后的 [ ] 应该要去掉。 (我不是说你缺少的结束符)FastJSON 随意解决的事情。0, compile com.alibaba:fastjson:1.2.71&#xff0c;去这个网站 http://www.jsonschema2pojo.org/粘贴你的json字符串1.1 Source type:JSON1.2 Annotation style:NONE1.3 所有…

网站建设外包合同模板多说与网站账号绑定

总结思考&#xff1a;如何做一个出色的开发者&#xff1f; 首先我们要承认我们大部分程序员是应用开发&#xff0c;不是操作系统、协议、框架开发等这类底层开发者。 其一&#xff1a;是否能快速定位问题。如找到出现问题的代码&#xff0c;bug出现在哪一行&#xff0c;哪个应…