一件登录facebook_我从Facebook的R教学中学到的6件事

一件登录facebook

Between 2018 to 2019, I worked at Facebook as a data scientist — during that time I was involved in developing and teaching a class for R beginners. This was a two-day course that was taught about once a month to a group of roughly 15–20 students, and the goal was that they would leave the class with the ability to use R in their day-to-day work.

乙切口白内障手术挽2018至19年,我曾在Facebook上的数据科学家-那段时间我曾参与开发和教学的R初学者一类。 这是一门为期两天的课程,每月大约有15至20名学生参加一次该课程,目的是让他们在日常工作中拥有使用R的能力。

This article goes shares some of the things that I learned from teaching these classes, with an emphasis on what worked well for the students. Hopefully these six tips can be of use for anyone that uses R, especially those just beginning their journey.

本文将分享我从这些课程的教学中学到的一些知识,并重点介绍对学生有效的方法。 希望这六个技巧对使用R的任何人都有用,尤其是刚开始使用R的人。

但是首先,我的个人经验学习R (But first, my personal experiences learning R)

I initially learned R as a statistics undergrad at Berkeley. In college I despised using R, and used it as a means to an end for completing projects and problem sets so that I could graduate.

我最初在伯克利学习R作为统计专业的本科生。 在大学里,我鄙视使用R,并将其用作完成项目和问题集以达到毕业的目的。

Once I entered the workforce and started learning R from my coworkers, my perspective towards the language started to shift. I realized that there were some key gaps on how R was taught in college — mainly that we were learning R for a classroom setting, which does not translate too well to a workplace setting.

一旦我进入工作队伍并开始从同事那里学习R,我对语言的看法就开始发生变化。 我意识到在大学教授R的方法上存在一些关键空白-主要是我们在教室环境中学习R,这对工作场所的设置并不太好。

Since graduating college, I have grown to embrace R fully— I’ve developed R packages at Facebook and Doordash, taught R at Facebook, and have attended several R conferences. With my background out of the way, I wanted to share some tips and advice for those on their own journey to using R in their day-to-day.

自大学毕业以来,我已经完全拥抱R —我在Facebook和Doordash开发了R软件包,在Facebook上教过R,并参加了几次R会议。 在没有背景的情况下,我想为那些在日常使用R的旅途中的人分享一些技巧和建议。

Note: I graduated college in 2015, so the curriculum has likely improved, so my personal experiences may not be as relevant for more recent college grads.

注意:我于2015年大学毕业,因此课程可能有所改善,因此我的个人经历可能与最近的大学毕业生不太相关。

1. R不仅适合数据科学家,而且有使用该语言的理由会使学习变得更容易 (1. R is not just for data scientists, and having a reason for using the language will make learning easier)

Before teaching R, I assumed that a large majority our students would be data scientists looking to increase their impact by bringing R into their SQL/Excel workflow. However, I was really surprised by the diversity of people that attended these classes. We had a good mix of software engineers, data scientists, data engineers, researchers, and finance/operations people just to name a few.

在教授R之前,我假设绝大多数学生都是数据科学家,他们希望通过将R引入他们SQL / Excel工作流程来增加其影响。 但是,我对参加这些课程的人的多样性感到非常惊讶。 我们汇集了软件工程师,数据科学家,数据工程师,研究人员以及财务/运营人员,仅举几例。

Image for post
Photo by Priscilla Du Preez on Unsplash
Priscilla Du Preez 摄于Unsplash

For data scientists, their main reason for taking the class was clear — they’re constantly working with data, and learning R will gives them a more effective and flexible way of working with data. Also, learning R will come more naturally as they have a lot of opportunity to practice the language while at the same time making a direct impact on their work.

对于数据科学家而言,他们上课的主要原因很明确-他们一直在处理数据,而学习R将为他们提供一种更有效,更灵活的数据处理方式。 另外,学习R会更自然,因为他们有很多机会练习语言,同时直接影响他们的工作。

When trying to understand why the some of the other students signed up for the class there were a variety of reasons, for example:

当试图理解为什么其他一些学生报名参加该课程时,有多种原因,例如:

  • Engineers who wanted to be able to improve their ability to modify and visualize data.

    希望能够提高其修改和可视化数据能力的工程师。
  • Operations and finance looking for an alternative for repetitive daily/weekly Excel updates.

    运营和财务部门正在寻找替代方案,以进行每日/每周重复的Excel更新。
  • People who are already familiar with R but wanted to freshen up their knowledge and learn how to use it effectively at Facebook.

    那些已经熟悉R但想要更新他们的知识并在Facebook上学习如何有效使用它的人们。

In the three examples above, we see ways that non-data scientists can gain value from learning R. These tangible use cases are great things to have to keep focused because learning R takes a fair amount of persistence. Broadly, you want to be in one of these two categories if you’re not a data scientist/analyst:

在上面的三个示例中,我们看到了非数据科学家从学习R中获得价值的方法。 这些有形的用例是必须重点关注的好事情,因为学习R需要相当多的持久性。 广义来说,如果您不是数据科学家/分析师,则希望属于以下两种类别之一:

  1. You’re already doing something and it can be improved/made faster by learning R

    您已经在做某事,可以通过学习R来改进/更快
  2. You want to do something but it will be very difficult/impossible without knowing R (or some other programming language)

    您想做点什么,但是如果不了解R (或其他编程语言) ,将非常困难/不可能。

One last point on this topic —sometimes R is not the best tool for the job. For example, if you already know how to use SQL+Excel you already have a deadly duo of tools to aggregate, analyze, and visualize data. Having used R myself for around 7 years, I often find myself resorting to SQL + Excel simply because it’s faster and more sharable. So if you spend a lot of time learning R, don’t feel like you need to use it for everything because sometimes it will actually take twice as long then if you use tools you’re already an expert in.

关于这个话题的最后一点-有时R并不是完成这项工作的最佳工具。 例如,如果您已经知道如何使用SQL + Excel,那么您已经拥有了致命的工具组合,用于汇总,分析和可视化数据。 使用R本身已有大约7年的时间,我经常发现自己求助于SQL + Excel是因为它更快,更易于共享。 因此,如果您花费大量时间学习R,就不需要使用它来做所有事情,因为有时使用R的时间实际上是使用R的两倍,而如果您已经是专家。

2. Tidyverse为王 (2. Tidyverse is king)

Image for post
Source: tidyverse.org
资料来源:tidyverse.org

What is Tidyverse? The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

什么是 Tidyverse tidyverse是为数据科学而设计的R软件包的自以为是的集合。 所有软件包都共享基本的设计理念,语法和数据结构。

The two most popular and useful packages in Tidyverse are:

Tidyverse中两个最受欢迎和最有用的软件包是:

Image for post
Source: tidyverse.org
资料来源:tidyverse.org

To keep this section short and to the point: Tidyverse is the quickest and most straightforward way to aggregate and modify data in R. Not only that, but it makes learning R a lot more fun and easy. I’ve first learned R without Tidyverse and it was a miserable experience, and others who learned R a similar way share my sentiments. Tidyverse has become so widespread amongst R users that I would not recommend learning/teaching R without it.

为了使本节简短明了Tidyverse是聚合和修改R中数据的最快,最直接的方法 。 不仅如此,它还使学习R变得更加有趣和轻松。 我最初是在没有Tidyverse的情况下学习R的,这是一次痛苦的经历,而其他以类似方式学习R的人也分享了我的观点。 Tidyverse已经在R用户中变得如此普遍,以至于我不建议没有 Rdy 学习/教学R。

If you’ve never used Tidyverse, it’s super simple to set up and I would highly encourage you to start using it (there are many resources online to learn)

如果您从未使用过Tidyverse,那么它的设置非常简单,我强烈建议您开始使用它(有很多在线资源可供学习)

# This is all you need to install tidyverse:install.pacakges('tidyverse')
library(tidyverse)

Note: I reference some packages later in this article, if you ever need to install a new package, you can use the function above to do so. Once installed, load it into R using library()

注意:我将在本文后面引用一些软件包,如果您需要安装新软件包,则可以使用上面的功能来安装。 安装后,使用library()其加载到R中

3.备忘单,备忘单,备忘单 (3. Cheatsheets, cheatsheets, cheatsheets)

This goes well with the previous topic because learning Tidyverse can be daunting at first with its unique syntax and long list of functions. Luckily, the RStudio team has created a bunch of cheatsheets. For our in-person classes, we would make sure to print cheat sheets for all of the students so that they wouldn’t have to keep switching tabs to search for functions. If you are able to, I would highly recommend printing and laminating your own cheat sheets for personal use. I still reference my cheat sheets even having used the language for over 5 years.

这与上一个主题非常吻合,因为学习Tidyverse最初可能因其独特的语法和长功能列表而令人生畏。 幸运的是,RStudio团队创建了很多备忘单。 对于我们的现场授课,我们将确保为所有学生打印备忘单,这样他们就不必继续切换选项卡来搜索功能 。 如果可以的话,我强烈建议您打印并层压自己的备忘单以供个人使用。 即使使用该语言已有5年以上,我仍然参考我的备忘单。

This website contains a list cheatsheets published by the RStudio team. Some of the topics here are more advanced, but I would say two essential cheat sheets to get started are the ones below:

该网站包含RStudio团队发布的清单备忘单。 这里的一些主题更高级,但是我要说的是以下两个基本的备忘单:

Image for post
Source: https://rstudio.com/resources/cheatsheets/
资料来源: https : //rstudio.com/resources/cheatsheets/
Image for post
Source: https://rstudio.com/resources/cheatsheets/
资料来源: https : //rstudio.com/resources/cheatsheets/

4.通过使用内部数据集学习 (4. Learn by using internal datasets)

Within the first hour of class, we have our students query data from the internal database into R. At Facebook, this would be as simple as using our internal package and writing:

在上课的第一个小时内,我们让学生将内部数据库中的数据查询到R中。在Facebook上,这就像使用内部程序包并编写以下代码一样简单:

df <- presto("SELECT * from example_table limit 10000")

There are two main reasons I recommend learning with internal datasets:

我建议学习内部数据集的主要原因有两个:

  • Being able to query internal data directly into your R amplifies your ability to use company data. If you are not able to query internal data directly into R, you’d have to do some sort of workaround such as exporting data into a csv file, then reading that into R. This wastes a lot of time, so I would try to get familiar with bringing data directly into R as early as possible, even if it means an extra hour or two of initial set up/getting the right permissions.

    能够直接查询R中的内部数据,从而增强了使用公司数据的能力。 如果您无法直接向R查询内部数据,则必须采取某种变通方法,例如将数据导出到csv文件中,然后再将其读入R。这会浪费很多时间,因此我将尝试尽早熟悉将数据直接带到R中,即使这意味着一两个小时的初始设置/获得正确的权限也是如此。

  • A company’s data is one of its most valuable resources. If you work at Facebook, then you should be taking advantage of the fact that you have some of the richest and most interesting datasets in the world. The same applies with any other company — Uber with its ride data, Airbnb with its bookings data, Medium with data on articles. A lot of online resources will have you use a generic dataset, so I would try to take the extra step and bring in key company datasets when possible to aid your learning. By doing this, you’re already in the mindset of easing R into your workflow.

    公司的数据是其最有价值的资源之一。 如果您在Facebook工作,那么您应该利用以下事实:您拥有世界上最丰富,最有趣的数据集。 其他公司也是如此,Uber拥有乘车数据,Airbnb拥有预订数据,Medium拥有商品数据。 很多在线资源将使您使用通用数据集,因此,我将尝试采取额外的步骤,并尽可能引入重要的公司数据集,以帮助您学习。 这样,您就已经可以将R放宽到工作流程中了。

5.导入和导出数据的重要性 (5. The importance of importing and exporting data)

R is a great tool for analyzing data but if you can’t get data into or out of R that’s a really big problem. The previous section touched a little bit on this, so this section is meant to be more practical and goes over some the main methods to get different types of data into/out of R.

R是用于分析数据的好工具,但是如果您无法将数据放入R中或从R中取出,那将是一个很大的问题。 上一节对此进行了一些介绍,因此本节旨在更加实用,并介绍了一些用于将不同类型的数据传入/传出R的主要方法。

By focusing on these methods, you should be able to import/export almost 100% of what is necessary. And of course, there is also a cheat sheet that you may find helpful for this:

通过专注于这些方法,您应该能够导入/导出几乎100%的必需品。 当然,还有一个备忘单 ,您可能会对此有所帮助:

Image for post
Source: https://rstudio.com/resources/cheatsheets/
资料来源: https : //rstudio.com/resources/cheatsheets/

For importing data:

导入数据:

  • Csv: read_csv() (Tidyverse)

    read_csv() read_csv() (Tidyverse)

  • Excel: read_excel() (Tidyverse)

    Excel: read_excel() (Tidyverse)

  • Google Sheets: Similar to the above, but may require extra steps for private sheets. You want to use the package googlesheets4. Worst case scenario, you export the Google Sheet as a csv and read it in using read_csv()

    Google表格:与上述类似,但对于私人表格可能需要额外的步骤。 您要使用包googlesheets4 。 最坏的情况是,您将Google表格导出为csv并使用read_csv()读取

  • Internal database: Use SQL to bring data directly into R. You’ll need to consult with your data team to see if there is an internal package to do this. At Facebook, presto("SELECT * FROM tbl")is all you need to grab data from a table. At smaller companies, there may be some extra steps to connect R to an internal database, but at the very least setting up ODBC connection should allow you to grab data.

    内部数据库:使用SQL将数据直接带到R中。您需要咨询数据团队,以查看是否有内部软件包可以执行此操作。 在Facebook上,只需presto("SELECT * FROM tbl")即可从表中获取数据。 在较小的公司中,可能需要一些额外的步骤才能将R连接到内部数据库,但是至少要设置ODBC连接才能允许您获取数据。

For exporting data:

对于导出数据:

  • Copy to clipboard: write_clip() from the clipr package copies a data frame directly into your clipboard. If your company uses Google Sheets, this is the quickest way to get data into there, so this is one of the most useful functions that you can learn. Essentially, it’s cutting down the steps from: Export df to csv -> Open csv and copy contents -> Paste into Sheets to Copy df to clipboard -> Paste into Sheets

    复制到剪贴板:来自clipr包的write_clip()将数据帧直接复制到剪贴板中。 如果您的公司使用Google表格,这是将数据导入其中的最快方法,因此这是您可以学习的最有用的功能之一。 从本质上讲,它减少了以下步骤: Export df to csv -> Open csv and copy contents -> Paste into SheetsCopy df to clipboard -> Paste into Sheets

  • Copy a plot/graph: When you make a graph in R, the easiest way to share it out is to copy/paste it. Simple zoom in on a plot to bring it into its own window, and you can right click and copy the image.

    复制图/图:在R中创建图时,最简单的共享方法是复制/粘贴。 只需简单地放大绘图,即可将其带到其自己的窗口中,然后可以右键单击并复制图像。

  • Screenshot directly from R: If you want to share out a small table more informally (i.e. Slack), taking a screenshot of your R console is probably the best bet. If you want to get fancy, you can use the kable() function from the knitr package to clean up your table so that it’s a little easier to read.

    直接来自R的屏幕截图:如果您想更非正式地共享一张小桌子(即Slack),那么为R控制台截图可能是最好的选择。 如果您想花哨的话,可以使用knitr包中的kable()函数清理表,以便于阅读。

# Format the iris table to be a little neateriris %>% head %>% kable
Image for post
This is the result
这是结果
  • Write to csv: write_csv()

    写入csv: write_csv()

  • Write to internal database: This is usually a lot more complicated than reading from an internal database, but would definitely talk your data team if you think you’ll do this often.

    写入内部数据库:与从内部数据库读取相比,这通常要复杂得多,但是如果您认为自己经常这样做,肯定会与您的数据团队联系。

6.保持简单,专注于基本原理 (6. Keep it simple and focus on the fundamentals)

There are so many things you can do with R, it can be a little overwhelming at first. For example, just in the cheat sheet link alone, you already see so many topics/packages that R is capable of, and even that is just scratching the surface. Don’t be intimated by this.

R可以做很多事情,一开始可能有点让人不知所措。 例如,仅在备忘单链接中 ,您已经看到了R能够支持的如此多的主题/程序包,甚至只是在刮擦表面。 不要被这个暗示。

We found that focusing on the fundamentals is the best way to learn R:

我们发现,专注于基础知识是学习R的最好方法:

  1. How to import data

    如何汇入资料
  2. Modifying the data with dplyr to do analysis

    使用dplyr修改数据以进行分析

  3. Creating visualizations with ggplot2

    使用ggplot2创建可视化

  4. Exporting results to share with your teammates

    导出结果以与您的队友共享

If you are able to do these well, then you will have a strong foundation for doing a lot with R.

如果您能够做到这些很好,那么您将为使用R做很多事打下坚实的基础。

总结思想 (Closing thoughts)

I wanted to write this article to because I enjoyed teaching R classes at Facebook, and thought that my unique experiences as an instructor could be helpful for others who do not have access to these types of classes or who are looking for advice on ways to use R more effectively in their own work.

我之所以写这篇文章,是因为我喜欢在Facebook上教授R课,并认为我作为一名讲师的独特经历会对那些无法使用此类课程或正在寻求使用方式建议的人有所帮助R在自己的工作中更有效。

翻译自: https://towardsdatascience.com/6-things-i-learned-from-teaching-r-at-facebook-806fc2832ec0

一件登录facebook

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389256.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

SiameseFC超详解

SiameseFC前言论文来源参考文章论文原理解读首先要知道什么是SOT&#xff1f;&#xff08;Siamese要做什么&#xff09;SiameseFC要解决什么问题&#xff1f;SiameseFC用了什么方法解决&#xff1f;SiameseFC网络效果如何&#xff1f;SiameseFC基本框架结构SiameseFC网络结构Si…

Python全栈工程师(字符串/序列)

ParisGabriel Python 入门基础字符串&#xff1a;str用来记录文本信息字符串的表示方式&#xff1a;在非注释中凡是用引号括起来的部分都是字符串‘’ 单引号“” 双引号 三单引""" """ 三双引有内容代表非空字符串否则是空字符串 区别&#xf…

跨库数据表的运算

跨库数据表的运算&#xff0c;一直都是一个说难不算太难&#xff0c;说简单却又不是很简单的、总之是一个麻烦的事。大量的、散布在不同数据库中的数据表们&#xff0c;明明感觉要把它们合并起来&#xff0c;再来个小小的计算&#xff0c;似乎也就那么回事……但真要做起来&…

熊猫在线压缩图_回归图与熊猫和脾气暴躁

熊猫在线压缩图数据可视化 (Data Visualization) I like the plotting facilities that come with Pandas. Yes, there are many other plotting libraries such as Seaborn, Bokeh and Plotly but for most purposes, I am very happy with the simplicity of Pandas plotting…

SiameseRPN详解

SiameseRPN论文来源论文背景一&#xff0c;简介二&#xff0c;研究动机三、相关工作论文理论注意&#xff1a;网络结构&#xff1a;1.Siamese Network2.RPN3.LOSS计算4.Tracking论文的优缺点分析一、Siamese-RPN的贡献/优点&#xff1a;二、Siamese-RPN的缺点&#xff1a;代码流…

数据可视化 信息可视化_可视化数据操作数据可视化与纪录片的共同点

数据可视化 信息可视化Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing data and kicki…

python 图表_使用Streamlit-Python将动画图表添加到仪表板

python 图表介绍 (Introduction) I have been thinking of trying out Streamlit for a while. So last weekend, I spent some time tinkering with it. If you have never heard of this tool before, it provides a very friendly way to create custom interactive Data we…

Python--day26--复习

转载于:https://www.cnblogs.com/xudj/p/9953293.html

SiameseRPN++分析

SiamRPN论文来源论文背景什么是目标跟踪什么是孪生网络结构Siamese的局限解决的问题论文分析创新点一&#xff1a;空间感知策略创新点二&#xff1a;ResNet-50深层网络创新点三&#xff1a;多层特征融合创新点四&#xff1a;深层互相关代码分析整体代码简述&#xff08;1&#…

Lockdown Wheelie项目

“It’s Strava for wheelies,” my lockdown project, combining hyper-local exercise with data analytics to track and guide improvement. Practising wheelies is a great way to stay positive; after all, it’s looking up, moving forward.我的锁定项目“将Strava运…

api地理编码_通过地理编码API使您的数据更有意义

api地理编码Motivation动机 In my second semester of my Master’s degree, I was working on a dataset which had all the records of the road accident in Victoria, Australia (2013-19). I was very curious to know, which national highways are the most dangerous …

SiamBAN论文学习

SiameseBAN论文来源论文背景主要贡献论文分析网络框架创新点一&#xff1a;Box Adaptive Head创新点二&#xff1a;Ground-truth创新点三&#xff1a;Anchor Free论文流程训练部分&#xff1a;跟踪部分论文翻译Abstract1. Introduction2. Related Works2.1. Siamese Network Ba…

实现klib_使用klib加速数据清理和预处理

实现klibTL;DRThe klib package provides a number of very easily applicable functions with sensible default values that can be used on virtually any DataFrame to assess data quality, gain insight, perform cleaning operations and visualizations which results …

MMDetection修改代码无效

最近在打比赛&#xff0c;使用MMDetection框架&#xff0c;但是无论是Yolo修改类别还是更改head&#xff0c;代码运行后发现运行的是修改之前的代码。。。也就是说修改代码无效。。。 问题解决办法&#xff1a; MMDetection在首次运行后会把一部分运行核心放在anaconda的环境…

docker etcd

etcd是CoreOS团队于2013年6月发起的开源项目&#xff0c;它的目标是构建一个高可用的分布式键值(key-value)数据库&#xff0c;用于配置共享和服务发现 etcd内部采用raft协议作为一致性算法&#xff0c;etcd基于Go语言实现。 etcd作为服务发现系统&#xff0c;有以下的特点&…

SpringBoot简要

2019独角兽企业重金招聘Python工程师标准>>> 简化Spring应用开发的一个框架&#xff1b;      整个Spring技术栈的一个大整合&#xff1b;      J2EE开发的一站式解决方案&#xff1b;      自动配置&#xff1a;针对很多Spring应用程序常见的应用功能&…

简明易懂的c#入门指南_统计假设检验的简明指南

简明易懂的c#入门指南介绍 (Introduction) One of the main applications of frequentist statistics is the comparison of sample means and variances between one or more groups, known as statistical hypothesis testing. A statistic is a summarized/compressed proba…

Torch.distributed.elastic 关于 pytorch 不稳定

错误日志&#xff1a; Epoch: [229] Total time: 0:17:21 Test: [ 0/49] eta: 0:05:00 loss: 1.7994 (1.7994) acc1: 78.0822 (78.0822) acc5: 95.2055 (95.2055) time: 6.1368 data: 5.9411 max mem: 10624 WARNING:torch.distributed.elastic.agent.server.api:Rec…

0x22 迭代加深

poj2248 真是个新套路。还有套路剪枝...大到小和判重 #include<cstdio> #include<iostream> #include<cstring> #include<cstdlib> #include<algorithm> #include<cmath> #include<bitset> using namespace std;int n,D,x[110];bool…

云原生全球最大峰会之一KubeCon首登中国 Kubernetes将如何再演进?

雷锋网消息&#xff0c;11月14日&#xff0c;由CNCF发起的云原生领域全球最大的峰会之一KubeConCloudNativeCon首次登陆中国&#xff0c;中国已经成为云原生领域一股强大力量&#xff0c;并且还在不断成长。 毫无疑问&#xff0c;Kubernetes已经成为容器编排事实标准&#xff…