电子表格转换成数据库_创建数据库,将电子表格转换为关系数据库,第1部分...

电子表格转换成数据库

Part 1: Creating an Entity Relational Diagram (ERD)

第1部分:创建实体关系图(ERD)

A Relational Database Management System (RDMS) is a program that allows us to create, update, and manage a relational database. Structured Query Language (SQL) is a programming language used to communicate with data stored in the RDMS. The SQL skill for using a RDMA is required for many data-related positions these days. In the social media forums like Quora or Reddit, there are many people who search for a public database for practicing their SQL querying skills. However, although there are many public data sets in a single spreadsheet, there are not many public databases online. Even if you found a data set for the topic you have interest in, the format of the data is usually just one spreadsheet, not a database for most cases. Therefore, it will be very useful to know how to convert a data set in one spreadsheet to a database with multiple tables fitting a relational database format. Knowing the process of this conversion can give us many chances to practice SQL querying skills with a variety of databases.

关系数据库管理系统 (RDMS)是允许我们创建,更新和管理关系数据库的程序。 结构化查询语言 (SQL)是一种编程语言,用于与RDMS中存储的数据进行通信。 如今,与许多数据相关的职位都需要使用RDMASQL技能。 在QuoraReddit等社交媒体论坛中,有很多人在搜索公共数据库以练习其SQL查询技能。 但是,尽管单个电子表格中有许多公共数据集,但在线的公共数据库并不多。 即使您找到了感兴趣的主题的数据集,数据的格式通常也只是一个电子表格,在大多数情况下不是数据库。 因此,了解如何将一个电子表格中的数据集转换为具有多个符合关系数据库格式的表的数据库将非常有用。 了解这种转换的过程可以使我们有很多机会来练习各种数据库SQL查询技能。

This is the first article in a two part series. The goal of this series is to show how to create a relational database for SQL. The whole process is to convert a data in one spreadsheet to a relational database for SQL. In this first article I create an Entity Relational Diagram (ERD) which is a graphical representation showing the relationships between entities.

这是分两部分的系列文章中的第一篇。 本系列的目的是展示如何为SQL创建关系数据库。 整个过程是将一个电子表格中的数据转换为SQL的关系数据库。 在第一篇文章中,我创建一个实体关系图(ERD) ,该图是显示实体之间关系的图形表示。

数据集 (The Data Set)

Let’s find a data set for this practice. There are many public data on the Kaggle dataset. Among a bunch of data sets, I selected a data set named Sample Sales Data. The following pictures show the original format of the data which is contained in one spreadsheet.

让我们为该练习找到一个数据集。 Kaggle数据集上有许多公共数据。 在一系列数据集中,我选择了一个名为Sample Sales Data的数据集 下图显示了一个电子表格中包含的数据的原始格式。

Image for post
Image for post
Sample Sales Data from the Kaggle Dataset
来自Kaggle数据集的样本销售数据

The data has 25 columns and 2824 rows including headers. The list of the headers is as follows:

数据有25列和2824行,包括标题。 标头列表如下:

  • ORDERNUMBER: the identification number for each order

    ORDERNUMBER :每个订单的标识号

  • QUANTITYORDERED: the quantity ordered

    QUANTITYORDERED :订购的数量

  • PRICEEACH: the actual price paid for the transaction (variable across transactions)

    PRICEEACH :为交易支付的实际价格(随交易而变化)

  • ORDERLINENUMBER: the number of the order line

    ORDERLINENUMBER :订单行的编号

  • SALES: the amount of sales

    销售 :销售额

  • ORDERDATE: the order date

    ORDERDATE :订单日期

  • STATUS: the shipping status (Shipped, Resolved, Cancelled, On Hold, Disputed, and In Progress)

    状态 :运输状态(已发货,已解决,已取消,保留,有争议和进行中)

  • QTR_ID: the quarter of the order date

    QTR_ID :订购日期的四分之一

  • MONTH_ID: the month of the order date

    MONTH_ID :订购日期的月份

  • YEAR_ID: the year of the order date

    YEAR_ID :订单日期的年份

  • PRODUCTLINE: the category of products

    PRODUCTLINE :产品类别

  • MSRP: the manufacture’s suggested retail price (constant across transactions)

    MSRP :制造商的建议零售价(在交易中保持不变)

  • PRODUCTCODE: the identification code for each product

    PRODUCTCODE :每个产品的识别码

  • CUSTOMERNAME: the names of customers

    客户名称:客户名称

  • PHONE: the phone numbers of customers

    电话 :客户的电话号码

  • ADDRESSLINE1: addressline 1 for customers

    ADDRESSLINE1 :客户的地址行1

  • ADDRESSLINE2: address line 2 for customers

    ADDRESSLINE2 :客户的地址行2

  • CITY: city names for customers

    城市 :客户的城市名称

  • STATE: state names for customers (only for customers located in the US)

    STATE :客户的州名(仅适用于位于美国的客户)

  • POSTALCODE: postal codes for customers

    邮政编码 :为客户邮政编码

  • COUNTRY: countries for customers

    国家 :客户所在的国家

  • TERRITORY: the regional names of each country (NA, EMEA, Japan, and APAC)

    地区 :每个国家的区域名称(NA,EMEA,日本和亚太地区)

  • CONTACTLASTNAME and CONTACTFIRSTNAME: the last and first names of customers

    CONTACTLASTNAMECONTACTFIRSTNAME :客户的姓氏和名字

  • DEALSIZE: the deal sizes of orders

    DEALSIZE :订单的交易大小

创建实体关系图(ERD) (Creating an Entity Relationship Diagram (ERD))

The information contained in the data can be basically divided into three entities: Customers, Products, and Orders.

数据中包含的信息基本上可以分为三个实体: 客户产品订单

顾客 (Customers)

Based on the columns in the data, the attributes to be included in the entity for Customers are CUSTOMERNAME, CONTACTLASTNAME, CONTACTFIRSTNAME, PHONE, ADDRESSLINE1, ADDRESSLINE2, POSTALCODE, CITY, STATE, COUNTRY, and TERRITORY. CustomerID is added into the entity as the primary key. And for convenience sake, the cases of the attributes are changed as follows:

根据数据中的列,要包含在“客户”实体中的属性是CUSTOMERNAMECONTACTLASTNAME,CONTACTFIRSTNAME, PHONEADDRESSLINE1ADDRESSLINE2POSTALCODECITY,STATE,COUNTRYTERRITORYCustomerID作为主键添加到实体中。 为了方便起见,属性的情况如下更改:

Image for post

The problem of keeping City, State, Country, and Territory in the entity for Customers is that there will be many duplicate rows in the table for Customers. Therefore, entities for City, Country, and Territory are separately created, and then the links between them and Customers are made using foreign keys. The reason why the entity for State is not created is that there are many null values in State so that states do not determine countries and territories in this data.

保留城市国家地区的问题 客户实体中 是表中的客户会有很多重复的行。 因此,将分别创建用于城市国家地区的实体,然后使用外键在它们和客户之间建立链接。 为什么不建立国有实体的原因是,在许多国家空值,这样的状态并不确定这个数据的国家和地区。

In the ERD, there are three types of relationships between entities: One-to-One, One-to-Many(or Many-to-One), and Many-to-Many. The relationships between Territories and Countries, Countries and Cities, and Cities and Customers are One-to-Many relationships. For example, each customer lives in one city in this data while one city has many customers. Therefore, using the Crow’s Foot Notation, the ERD between Customers, Cities, Countries, and Territories is as follows:

在ERD中,实体之间存在三种类型的关系: 一对一,一对多(或多对一)和多对多 。 区域国家国家城市以及城市客户之间的关系是一对多关系。 例如,在此数据中,每个客户居住在一个城市中,而一个城市有许多客户。 因此,使用乌鸦脚符号 , 客户之间的ERD, 城市国家 ,和地区如下:

Image for post
The Crow’s Foot Notations for Customers, Cities, Countries, and Territories
客户,城市,国家和地区的鱼尾纹符号

产品展示 (Products)

The attributes to be included in the entity for Products are PRODUCTCODE, PRODUCTLINE, MSRP, and ORDERNUMBER. Among these attributes, PRODUCTCODE can be a candidate for the primary key in the entity since it is unique and does not have null values. However, in order to keep the format of the primary keys consistent across entities, ProductID is added into the entity and is set to be the primary key. The letter cases are changed as follows:

产品实体中要包含的属性是PRODUCTCODEPRODUCTLINEMSRPORDERNUMBER 。 在这些属性中, PRODUCTCODE可以是实体中主键的候选者,因为它是唯一的并且没有空值。 但是,为了使主键的格式在各个实体之间保持一致, 产品编号 被添加到实体中并设置为主键。 字母大小写更改如下:

Image for post

As we saw in the case for the entity for Customers, including Productline in this entity generates duplicate rows as well. Therefore, the entity for ProductLines is created separately, and is linked to the entity for Products using a foreign key. Since each product is classified into one product line while each product line has many products, the relationship between ProductLines and Products is One-to-Many.

正如我们在“ 客户 ”实体案例中看到的那样,在该实体中包括Productline也会生成重复的行。 因此, ProductLine的实体是单独创建的,并使用外键链接到Products的实体。 由于每个产品分类为一个产品线,而每个产品线具有许多产品,因此ProductLineProducts之间的关系为一对多

Image for post
The Crows’ Notation for ProductLines and Products
产品线和产品的乌鸦符号

This is not the final version for Products since OrderNumber will be included in the entity for Orders as well. Including OrderNumber in the entity for Orders will make some changes in the entity for Products. This will be covered after the entity for Orders is created.

这并非以来ORDERNUMBER 产品将包含在订单的实体以及最终版本。 在订单的实体,包括订单号码将在产品实体的一些变化。 创建订单实体后将对此进行介绍。

命令 (Orders)

The entity for Orders should include ORDERNUMBER, ORDERDATE, PRODUCTCODE, STATUS, QUARTER_ID, MONTH_ID, YEAR_ID,and CUSTOMERNAME as its attributes. OrderID is added into this entity as the primary key. The names and cases of the attributes are changed as follows for convenience sake:

订单实体应包括ORDERNUMBERORDERDATEPRODUCTCODESTATUSQUARTER_IDMONTH_IDYEAR_IDCUSTOMERNAME作为其属性。 OrderID作为主键添加到该实体中。 为了方便起见,属性的名称和大小写更改如下:

Image for post

In order to remove duplicate rows generated by including Status and CustomerName in this entity, the entities for Status and CustomerNames are created separately and are linked to the entity for Orders using foreign keys. The entity for CustomerNames (which is equivalent to the Customers) was already created previously. The relationships of Orders with Status and Customers are Many-to-One, respectively.

为了删除通过在此实体中包含StatusCustomerName生成的重复行,分别创建了StatusCustomerNames的实体,并使用外键将其链接到Order的实体。 CustomerNames的实体(等同于Customer )是先前已创建的。 订单状态客户的关系分别是多对一的

Image for post
The Crow’s Foot Notations for Orders, Customers, and Status
订单,客户和状态的鱼尾纹

订单和产品 (Orders and Products)

Each order contains a list of products in it. This means that Orders and Products are related to each other. What is the relationship between these two entries? Each product can be ordered by multiple orders and each order can contain multiple products. Therefore, the relationship is many-to-many.

每个订单中都包含一个产品列表。 这意味着订单产品相互关联。 这两个条目之间是什么关系? 每个产品可以按多个订单订购,每个订单可以包含多个产品。 因此,这种关系是多对多的

Image for post
The Crow’s Foot Notation for Orders and Products 1
订单和产品的鱼尾纹1

The way to create tables which are in the many-to-many relationship is to create a junction table between the two original tables and then to make one-to-many relationships between the original tables and the junction table. The junction table is the table for the combination of Orders and Products, and then one-to-many relationships are made between Orders and the junction, and between Products and the junction. The primary key for this junction entity should be the composite of OrderID and ProductID.

创建作为在许多一对多的关系表的方法是将原来的两个表之间创建一个表,然后做一个对许多原始表和联接表之间的关系。 联结表是OrdersProducts组合的表,然后在Orders和联结之间以及Products和联结之间建立一对多关系。 此联结实体的主键应为OrderID的组合 ProductID

The remaining columns which are not included in any entity so far are QUANTITYORDERED, PRICEEACH, ORDERLINENUMBER, SALES, and DEALSIZE. These columns cannot be determined by Orders or Products by themselves. For example, the quantity of a product in an order cannot be identified by an order or a product alone. Only the combination of an order and a product can identify it. Therefore, those columns are included in the junction entity as its attributes. The entities and their relationships for Orders and Products are as follows:

到目前为止,尚未包含在任何实体中的其余列为QUANTITYORDEREDPRICEEACHORDERLINENUMBERSALESDEALSIZE 。 这些列不能由订单产品自行确定。 例如,订单中的产品数量无法通过订单或单独的产品来识别。 只有订单和产品的组合才能识别它。 因此,这些列作为其属性包含在联结实体中。 订单产品的实体及其关系如下:

Image for post
The Crow’s Foot Notations for Orders and Products 2
订单和产品的鱼尾纹2

Due to duplicate rows, the entity for DealSizes is separately created and linked to the entity for Orders_Products using foreign key for SizeID.

由于重复行,为DealSizes实体被单独创建并链接到实体使用SizeID外键Orders_Products。

整个ERD (The Entire ERD)

Now, we completed creating all the entities for this data and making relationships between them. The entire ERD is as follows:

现在,我们完成了为该数据创建所有实体并在它们之间建立关系。 整个ERD如下:

Image for post
The Final Version of the ERD for Sample Sales Data
样品销售数据ERD的最终版本

The usefulness of the ERD is that it makes us to easily understand the relationships between the columns (or the entities) in the data set and to see the entire structure of the data set at a glance.

ERD的有用之处在于,它使我们能够轻松了解数据集中的列(或实体)之间的关系,并一目了然地看到数据集的整个结构。

下一步是什么? (What’s Next?)

This article covered how to create an Entity Relationship Diagram using a data set. The data for this practice is Sample Sales Data obtained from the Kaggle dataset. It is briefly covered how to divide the original data set into entities and make relationships between the entities.

本文介绍了如何使用数据集创建实体关系图。 该实践的数据是从Kaggle数据集获得的样本销售数据 。 简要介绍了如何将原始数据集划分为实体以及如何在实体之间建立关系。

In the next article, the practice of converting a data in a single spread sheet to a relational database by using MySQL will be covered.

在下一篇文章中 ,将介绍通过使用MySQL将单个电子表格中的数据转换为关系数据库的实践。

翻译自: https://medium.com/swlh/creating-a-database-converting-a-spreadsheet-to-a-relational-database-part-1-2a9a228bf77a

电子表格转换成数据库

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389965.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【Vue.js学习】生命周期及数据绑定

一、生命后期 官网的图片说明: Vue的生命周期总结 var app new Vue({el:"#app", beforeCreate: function(){console.log(1-beforeCreate 初始化之前);//加载loading},created: function(){console.log(2-created 创建完成);//关闭loading},be…

Springboot(2.0.0.RELEASE)+spark(2.1.0)框架整合到jar包成功发布(原创)!!!

一、前言 首先说明一下,这个框架的整合可能对大神来说十分容易,但是对我来说十分不易,踩了不少坑。虽然整合的时间不长,但是值得来纪念下!!!我个人开发工具比较喜欢IDEA,创建的sprin…

求一个张量的梯度_张量流中离散策略梯度的最小工作示例2 0

求一个张量的梯度Training discrete actor networks with TensorFlow 2.0 is easy once you know how to do it, but also rather different from implementations in TensorFlow 1.0. As the 2.0 version was only released in September 2019, most examples that circulate …

zabbix网络发现主机

1 功能介绍 默认情况下,当我在主机上安装agent,然后要在server上手动添加主机并连接到模板,加入一个主机组。 如果有很多主机,并且经常变动,手动操作就很麻烦。 网络发现就是主机上安装了agent,然后server自…

python股市_如何使用python和破折号创建仪表板来主导股市

python股市始终关注大局 (Keep Your Eyes on the Big Picture) I’ve been fascinated with the stock market since I was a little kid. There is certainly no shortage of data to analyze, and if you find an edge you can make some easy money. To stay on top of the …

阿里巴巴开源 Sentinel,进一步完善 Dubbo 生态

为什么80%的码农都做不了架构师?>>> 阿里巴巴开源 Sentinel,进一步完善 Dubbo 生态 Sentinel 开源地址:https://github.com/alibaba/Sentinel 转载于:https://my.oschina.net/dyyweb/blog/1925839

离群值如何处理_有理处理离群值的局限性

离群值如何处理ARIMA models can be quite adept when it comes to modelling the overall trend of a series along with seasonal patterns.ARIMA模型可以很好地建模一系列总体趋势以及季节性模式。 In a previous article titled SARIMA: Forecasting Seasonal Data with P…

10生活便捷:购物、美食、看病时这样搜,至少能省一半心

本次课程介绍实实在在能够救命、省钱的网站,解决了眼前这些需求后,还有“诗和远方”——不花钱也能点亮自己的生活,获得美的享受! 1、健康医疗这么搜,安全又便捷 现在的医疗市场确实有些混乱,由于医疗的专业…

ppt图表图表类型起始_梅科图表

ppt图表图表类型起始There are different types of variable width bar charts but two are the most popular: 1) Bar Mekko chart; 2) Marimekko chart.可变宽度条形图有不同类型,但最受欢迎的有两种:1)Mekko条形图; 2)Marimekko图表。 Th…

Tomcat日志乱码了怎么处理?

【前言】 tomacat日志有三个地方,分别是Output(控制台)、Tomcat Localhost Log(tomcat本地日志)、Tomcat Catalina Log。 启动日志和大部分报错日志、普通日志都在output打印;有些错误日志,在Tomcat Localhost Log。 三个日志显示区,都可能…

5888. 网络空闲的时刻

5888. 网络空闲的时刻 给你一个有 n 个服务器的计算机网络,服务器编号为 0 到 n - 1 。同时给你一个二维整数数组 edges ,其中 edges[i] [ui, vi] 表示服务器 ui 和 vi 之间有一条信息线路,在 一秒 内它们之间可以传输 任意 数目的信息。再…

django框架预备知识

内容: 1.web预备知识 2.django介绍 3.web框架的本质及分类 4.django安装与基本设置 1.web预备知识 HTTP协议:https://www.cnblogs.com/wyb666/p/9383077.html 关于web的本质:http://www.cnblogs.com/wyb666/p/9034042.html 如何自定义web框架…

现实世界 机器学习_公司沟通分析简介现实世界的机器学习方法

现实世界 机器学习In my previous posts I covered analytical subjects from a scientific point of view, rather than an applied real world problem. For this reason, this article aims at approaching an analytical idea from a managerial point of view, rather tha…

拷贝构造函数和赋值函数

1、拷贝构造函数:用一个已经有的对象构造一个新的对象。 CA(const CA & c )函数的名称必须和类名称相一致,它的唯一的一个参数是本类型的一个引用变量,该参数是const 类型,不可变。 拷贝构造函数什么时…

Chrome keyboard shortcuts

2019独角兽企业重金招聘Python工程师标准>>> Chrome keyboard shortcuts https://support.google.com/chrome/answer/157179?hlen 转载于:https://my.oschina.net/qwfys200/blog/1927456

数据中心细节_当细节很重要时数据不平衡

数据中心细节定义不平衡数据 (Definition Imbalanced Data) When we speak of imbalanced data, what we mean is that at least one class is underrepresented. For example, when considering the problem of building a classifier, let’s call it the Idealisstic-Voter.…

辛普森悖论_所谓的辛普森悖论

辛普森悖论We all know the Simpsons family from Disneyland, but have you heard about the Simpson’s Paradox from statistic theory? This article will illustrate the definition of Simpson’s Paradox with an example, and show you how can it harm your statisti…

查看NVIDIA使用率工具目录

2019独角兽企业重金招聘Python工程师标准>>> C:\Program Files\NVIDIA Corporation\Display.NvContainer\NVDisplay.Container.exe 转载于:https://my.oschina.net/u/2430809/blog/1927560

余弦相似度和欧氏距离_欧氏距离和余弦相似度

余弦相似度和欧氏距离Photo by Markus Winkler on UnsplashMarkus Winkler在Unsplash上拍摄的照片 This is a quick and straight to the point introduction to Euclidean distance and cosine similarity with a focus on NLP.这是对欧氏距离和余弦相似度的快速而直接的介绍&…

七、 面向对象(二)

匿名类对象 创建的类的对象是匿名的。当我们只需要一次调用类的对象时,我们就可以考虑使用匿名的方式创建类的对象。特点是创建的匿名类的对象只能够调用一次! package day007;//圆的面积 class circle {double radius;public double getArea() {// TODO…