惯性张量的推理_选择合适的intel工作站处理器进行张量流推理和开发

惯性张量的推理

With the increasing number of data scientists using TensorFlow, it might be a good time to discuss which workstation processor to choose from Intel’s lineup. You have several options to choose from:

随着使用TensorFlow的数据科学家数量的增加,现在是讨论从Intel阵容中选择哪种工作站处理器的好时机。 您可以选择以下几种方式:

  • Intel Core processors–with i5, i7, and i9 being the most popular

    英特尔酷睿处理器-i5,i7和i9最受欢迎
  • Intel Xeon W processors, which are optimized for workstation workloads

    Intel Xeon W处理器,针对工作站工作负载进行了优化
  • Intel Xeon Scalable processors (SP), which are optimized for server workloads and 24/7 operation

    英特尔至强可扩展处理器(SP),针对服务器工作负载和24/7操作进行了优化

The next logical question would be what processor to choose if TensorFlow inference performance is critical? The first thing we need to do is to look at where the performance is coming from in the TensorFlow library. One of the main influences on TensorFlow performance (and many other machine learning libraries) is the Advanced Vector Extensions (AVX), specifically those found in Intel AVX2 and Intel AVX-512. Intel’s runtime libraries use AVX, which power TensorFlow performance on Intel processors via the oneAPI Deep Neural Network Library (oneDNN). Other specialized instruction sets such as Vector Neural Network Instructions (VNNI) from Intel Deep Learning Boost are also called by oneDNN.

下一个逻辑问题是,如果TensorFlow推理性能至关重要,则应选择哪个处理器? 我们要做的第一件事是查看TensorFlow库中的性能来自哪里。 对TensorFlow性能(以及许多其他机器学习库)的主要影响之一是高级矢量扩展(AVX),特别是在Intel AVX2和Intel AVX-512中发现的扩展。 英特尔的运行时库使用AVX,后者通过oneAPI深度神经网络库(oneDNN)增强了Intel处理器上的TensorFlow性能。 其他专用指令集,例如Intel Deep Learning Boost的矢量神经网络指令(VNNI),也被oneDNN调用。

What other factors matter? Does the number of cores matter? Base clock speeds? Let’s benchmark a few Intel processors to get a better understanding. For this test, we have five configurations in workstation chassis (Table 1).

还有其他因素吗? 核心数量重要吗? 基本时钟速度? 让我们对一些英特尔处理器进行基准测试以获得更好的理解。 对于此测试,我们在工作站机箱中有五种配置(表1)。

Image for post
Table 1. Benchmarking systems
表1.基准测试系统

We are using the ResNet-50 model with the ImageNet data set, tested with different batch sizes for inference throughput and latency. Figure 1 shows how many images the inference model can handle per second. The 18-core systems consistently deliver better throughput. What you’re seeing in these TensorFlow benchmarks is how machine learning (ML) and deep learning (DL) translate from framework to algorithm, and then algorithm to hardware. At the end of the day, there is a limit to how well many AI algorithms. Many ML and DL algorithms aren’t naturally parallel, and in a workstation configuration where the power envelope is defined by the wall socket’s maximum current, a balance between core count and core frequency must be taken into consideration.

我们将ResNet-50模型与ImageNet数据集一起使用,并针对不同的批处理量测试了推理吞吐量和延迟。 图1显示了推理模型每秒可以处理多少张图像。 18核系统始终提供更高的吞吐量。 您在这些TensorFlow基准测试中看到的是机器学习(ML)和深度学习(DL)如何从框架转换为算法,然后从算法转换为硬件。 归根结底,人工智能算法的数量是有限的。 许多ML和DL算法并非自然并行,在工作站配置中,功率包络由墙上插座的最大电流定义,必须考虑芯数和芯频率之间的平衡。

Image for post
Figure 1. TensorFlow inference throughput on the benchmarking systems
图1.基准测试系统上的TensorFlow推理吞吐量

Let’s take a deeper look at Figure 1. If we compare the dual-socket Intel Xeon 6258R to the single-socket 6240L, the results show that an 18-core processor with slightly higher frequencies is better for TensorFlow inference than one with over 6x the number of cores. The lesson here is that many ML and DL don’t scale well, so more cores may not always be better.

让我们更深入地看一下图1。如果将双插槽Intel Xeon 6258R与单插槽6240L进行比较,结果表明,使用频率稍高的18核处理器比TensorFlow推理的频率高6倍的处理器更好。核心数。 这里的教训是,许多ML和DL的伸缩性都不好,因此更多的内核可能并不总是更好。

Figure 2 shows the inference latency on the benchmarking systems. This is the time it takes an inference model loaded in memory to make a prediction based on new data. Inference latency is important for time-sensitive or real-time applications. The dual-socket system has slightly higher latency in FP32 but the lowest latency in INT8. The 18-core systems have similar latencies and exhibit performance in line with the throughput performance rankings in Figure 1.

图2显示了基准测试系统上的推理延迟。 这是将推理模型加载到内存中以根据新数据进行预测的时间。 推理延迟对于时间敏感或实时应用很重要。 双路系统在FP32中具有稍高的延迟,但在INT8中具有最低的延迟。 18核系统具有类似的延迟,并且表现出与图1中的吞吐量性能排名一致的性能。

Image for post
Figure 2. TensorFlow inference latency on the benchmarking systems
图2.基准测试系统上的TensorFlow推理延迟

The Intel Xeon W2295 does the best in most of the tests, but why is that? It has to do with the Intel AVX-512 base and turbo frequencies. The Intel Xeon W processor series is clocked higher than the Intel Xeon SP under AVX instructions. Under any AVX instructions, the processor moves to a different speed to offset the additional power draw, and with the vast majority of ML and DL using AVX-512, the higher base and turbo frequencies of the Intel Xeon W give faster throughput over the comparable Intel Xeon SP processor. Additionally, 18 cores appears to be the best balance between core count and AVX-512 frequency in these tests: more cores over 18 sacrificing AVX frequencies and increasing latency, and fewer cores decreasing in throughput and increasing in latency.

英特尔至强W2295在大多数测试中表现最好,但是为什么呢? 它与Intel AVX-512基本频率和Turbo频率有关。 根据AVX指令,Intel Xeon W处理器系列的时钟频率高于Intel Xeon SP。 在任何AVX指令下,处理器将以不同的速度移动以抵消额外的功耗,并且在使用AVX-512的绝大多数ML和DL中,Intel Xeon W的较高基本频率和Turbo频率提供了比同类产品更快的吞吐量。英特尔至强SP处理器。 此外,在这些测试中,18个内核似乎是内核数量与AVX-512频率之间的最佳平衡:超过18个内核牺牲了AVX频率并增加了延迟,更少的内核减少了吞吐量并增加了延迟。

Why is there such an advantage in INT8 batch inference with the Intel Xeon processors over the Intel Core i9? What you are seeing there is the use of the VNNI instructions by oneDNN, which reduce convolution operations from three instructions to one. The Intel Xeon processors used in these benchmarks support VNNI for INT8, but the Intel Core processor does not. The performance difference is quite noticeable in the previous charts.

为什么与英特尔®酷睿™i9相比,英特尔®至强®处理器在INT8批量推理中具有如此优势? 您所看到的是oneDNN使用VNNI指令,这将卷积运算从三个指令减少到一个。 这些基准测试中使用的Intel Xeon处理器支持INT8的VNNI,但Intel Core处理器不支持。 在以前的图表中,性能差异非常明显。

Finally, let’s talk about how to choose the Intel processor to best fit your TensorFlow requirements:

最后,让我们讨论一下如何选择最适合您的TensorFlow要求的英特尔处理器:

  • Do you need large memory to load the data set? Do you need the ability to administer your workstation remotely? If so, get a workstation with the Intel Xeon Gold 6240L, which can be configured with up to 3.3 TB of memory using a mix of Intel Optane DC Persistent Memory and DRAM.

    您是否需要大内存来加载数据集? 您是否需要能够远程管理工作站? 如果是这样的话,请购买配备Intel Xeon Gold 6240L的工作站,该工作站可以使用Intel Optane DC持久性内存和DRAM进行配置,最多可配置3.3 TB内存。
  • Need the best all-rounder with the Intel Xeon features with moderate system memory? Use the Intel Xeon W2295. In lieu of some of the server-class features like Intel Optane DCPMM and 24/7 operation, you can get equivalent inference performance at half the cost of the Intel Xeon SP configurations and over 30% less power.

    是否需要具有适度系统内存的Intel Xeon功能的最佳全能产品? 使用英特尔至强W2295。 代替某些服务器级功能(如Intel Optane DCPMM和24/7操作),您可以获得等效的推理性能,而成本仅为Intel Xeon SP配置的一半,而功耗却降低了30%以上。
  • Need a budget-friendly option? An Intel Core processor such as the i9–10900k fits the bill.

    需要预算友好的选择吗? 像i9–10900k这样的Intel Core处理器非常适合。
  • Have additional inference needs on the workstation beyond the CPU? We have products such as Intel Movidius and purpose-built AI processors from Intel’s Habana product line that can help fit those needs.

    除了CPU,工作站上还有其他推理需求吗? 我们拥有英特尔Movidius等产品以及英特尔Habana产品系列中的专用AI处理器,可以帮助满足这些需求。

With the performance attributes of TensorFlow detailed above, picking the right workstation CPU should be a bit easier.

有了上面详细介绍的TensorFlow的性能属性,选择合适的工作站CPU应该会容易一些。

If you want to reproduce these tests to evaluate your TensorFlow needs, use the following instructions. First download the GitHub repo (https://github.com/IntelAI/models) and configure the Conda (Channel: Intel, Python=3.7.7) and runtime environment:

如果要重现这些测试以评估TensorFlow需求,请使用以下说明。 首先下载GitHub存储库( https://github.com/IntelAI/models )并配置Conda(渠道:Intel,Python = 3.7.7)和运行时环境:

  • Set OMP_NUM_THREADS to the number of cores

    OMP_NUM_THREADS设置为内核数

  • KMP_BLOCKTIME=0

    KMP_BLOCKTIME=0

  • intra_op_parallelism_threads=<cores>

    intra_op_parallelism_threads=<cores>

  • inter_op_parallelism_threads=2

    inter_op_parallelism_threads=2

  • Prepend numactl --cpunodebind=0 --membind=0 to the command below for systems with two or more sockets

    对于具有两个或多个套接字的系统,在下面的命令前添加numactl --cpunodebind=0 --membind=0

Finally, run the following command: python launch_benchmark.py --in-graph <built model> --model-name resnet50 --framework tensorflow --precision fp32<or int8> --mode inference --batch-size=128 --socket-id 0 --data-location <synthetic or real dataset>

最后,运行以下命令: python launch_benchmark.py --in-graph <built model> --model-name resnet50 --framework tensorflow --precision fp32<or int8> --mode inference --batch-size=128 --socket-id 0 --data-location <synthetic or real dataset>

For more information or to learn more about Intel products, please visit www.intel.com.

有关更多信息或要了解有关英特尔产品的更多信息,请访问www.intel.com 。

Image for post

翻译自: https://medium.com/intel-analytics-software/choosing-the-right-intel-workstation-processor-for-tensorflow-inference-and-development-4afeec41b2a9

惯性张量的推理

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390008.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

MongoDB数据库查询性能提高40倍

MongoDB数据库查询性能提高40倍 大家在使用 MongoDB 的时候有没有碰到过性能问题呢&#xff1f;下面这篇文章主要给大家分享了MongoDB数据库查询性能提高40倍的经历&#xff0c;需要的朋友可以参考借鉴&#xff0c;下面来一起看看吧。 前言 数据库性能对软件整体性能有着至关重…

牛客网_Go语言相关练习_选择题(2)

注&#xff1a;题目来源均出自牛客网。 一、选择题 Map&#xff08;集合&#xff09;属于Go的内置类型&#xff0c;不需要引入其它库即可使用。 Go-Map_菜鸟教程 在函数声明中&#xff0c;返回的参数要么都有变量名&#xff0c;要么都没有。 C选项函数声明语法有错误&#xff0…

Java常用的八种排序算法与代码实现

排序问题一直是程序员工作与面试的重点&#xff0c;今天特意整理研究下与大家共勉&#xff01;这里列出8种常见的经典排序&#xff0c;基本涵盖了所有的排序算法。 1.直接插入排序 我们经常会到这样一类排序问题&#xff1a;把新的数据插入到已经排好的数据列中。将第一个数和第…

熊猫ai智能机器人量化_机器学习中的熊猫是什么

熊猫ai智能机器人量化Machine learning is a complex discipline. The implementation of machine learning models is now far much easier than it used to be, this is as a result of Machine learning frameworks such as pandas. Wait!! isnt panda an animal? As I rec…

441. 排列硬币

441. 排列硬币 你总共有 n 枚硬币&#xff0c;并计划将它们按阶梯状排列。对于一个由 k 行组成的阶梯&#xff0c;其第 i 行必须正好有 i 枚硬币。阶梯的最后一行 可能 是不完整的。 给你一个数字 n &#xff0c;计算并返回可形成 完整阶梯行 的总行数。 示例 1&#xff1a;…

调用百度 Echarts 显示重庆市地图

因为 Echarts 官方不再提供地图数据的下载&#xff0c;在这里保存一份&#xff0c;供日后使用&#xff0c;重庆地图数据的 JSON 文件在 CSDN 上下载。 <!DOCTYPE html> <html style"height: 100%"><head><meta charset"utf-8"><…

JEESZ-SSO解决方案

2019独角兽企业重金招聘Python工程师标准>>> 第一节&#xff1a;单点登录简介 第一步&#xff1a;了解单点登录 SSO主要特点是: SSO应用之间使用Web协议(如HTTPS)&#xff0c;并且只有一个登录入口. SSO的体系中有下面三种角色: 1) User(多个) 2) Web应用(多个) 3) …

女朋友天天气我怎么办_关于我的天气很奇怪

女朋友天天气我怎么办带有扭曲的天气应用 (A Weather App with a Twist) Is My Weather Weird?™ is a weather app with a twist — it offers a simple answer to a common question we’ve all asked. To do this we look at how often weather like today’s used to happ…

5895. 获取单值网格的最小操作数

5895. 获取单值网格的最小操作数 给你一支股票价格的数据流。数据流中每一条记录包含一个 时间戳 和该时间点股票对应的 价格 。 不巧的是&#xff0c;由于股票市场内在的波动性&#xff0c;股票价格记录可能不是按时间顺序到来的。某些情况下&#xff0c;有的记录可能是错的…

为什么要用Redis

最近阅读了《Redis开发与运维》&#xff0c;非常不错。这里对书中的知识整理一下&#xff0c;方便自己回顾一下Redis的整个体系&#xff0c;来对相关知识点查漏补缺。我按照五点把书中的内容进行一下整理&#xff1a;为什么要选择Redis&#xff1a;介绍Redis的使用场景与使用Re…

区块链开发公司谈区块链在商业上的应用

对于近期正受科技界和资本市场关注的区块链行业&#xff0c;一句话概括说如果互联网技术解决的是通讯问题的话&#xff0c;区块链技术解决的是信任问题&#xff0c;其在商业领域应用如何呢&#xff1f;我们来从两个方面去进行剖析。 第一方面&#xff0c;区块链技术可以解决基础…

ORACLE1.21 PLSQL 01

-- 有了SQL 为什么还需要PL/SQL -- SQL功能很强大&#xff0c;但如果是单1sql语句&#xff0c;没有流程控制 -- PL/SQL 是什么&#xff1f; --不仅仅实现流程控制&#xff0c;同时保留SQL本身所有的功能 --还提供变量、常量等支持 --提供更多数据类型的支持 --第一&#xff0c;…

云原生数据库_数据标签竞赛云原生地理空间冲刺

云原生数据库STAC specification is getting closer to the ver 1.0 milestone, and as such the first virtual Cloud Native Geospatial Sprint is being organized next week. An outreach day is planned on Sep 8th with a series of talks and tutorials for everyone. R…

Linux 下的 hosts文件

2019独角兽企业重金招聘Python工程师标准>>> hosts 文件 目录在 /etc/hosts netstat -ntlp //linux 下查看端口 转载于:https://my.oschina.net/u/2494575/blog/1923074

DjangoORM字段介绍

转载于:https://www.cnblogs.com/cansun/p/8647371.html

黑客独角兽_双独角兽

黑客独角兽Preface前言 Last week my friend and colleague Srivastan Srivsan’s note on LinkedIn about Mathematics and Data Science opened an excellent discussion. Well, it is not something new; there were debates in the tech domain such as vim v.s emacs to …

38. 外观数列

38. 外观数列 给定一个正整数 n &#xff0c;输出外观数列的第 n 项。 「外观数列」是一个整数序列&#xff0c;从数字 1 开始&#xff0c;序列中的每一项都是对前一项的描述。 你可以将其视作是由递归公式定义的数字字符串序列&#xff1a; countAndSay(1) “1”countAnd…

Lab1

1.导入 JUnit&#xff0c;Hamcrest Project -> Properites -> Java Build Path -> Add External JARs 2. 安装 Eclemma Help -> Eclipse marketplace 搜索 Eclemma&#xff0c;点击Installed 3. 测试代码 TrianglePractice&#xff1a; public class TrianglePract…

551. Student Attendance Record I 从字符串判断学生考勤

&#xff3b;抄题&#xff3d;&#xff1a; You are given a string representing an attendance record for a student. The record only contains the following three characters: A : Absent. L : Late.P : Present. A student could be rewarded if his attendance record…

使用deploy命令上传jar到私有仓库

打开cmd命令提示符&#xff0c;mvn install是将jar包安装到本地库&#xff0c;mvn deploy是将jar包上传到远程server&#xff0c;install和deploy都会先自行bulid编译检查&#xff0c;如果确认jar包没有问题&#xff0c;可以使用-Dmaven.test.skiptrue参数跳过编译和测试。 全命…