熊猫直播 使用什么sdk
Python has become the go-to programming language for many data scientists and machine learning researchers. One essential data processing tool for them to make this choice is the pandas library. For sure, the pandas library is so versatile that it can be used for almost all initial data manipulations to get the data ready for conducting statistical analyses or building machine learning models.
对于许多数据科学家和机器学习研究人员来说,Python已经成为编程语言。 供他们选择的一种重要数据处理工具是pandas库。 可以肯定的是,pandas库的用途非常广泛,几乎可以用于所有初始数据操作,从而为进行统计分析或构建机器学习模型做好准备。
However, for the same versatility, it can be overwhelming to start to use it comfortably. If you’re struggling on how to get it started, this article is right for you. Instead of covering too much to lose the points, the goal of the article is to provide an overview of key operations that you want to use in your daily data processing tasks. Each key operation is accompanied with some highlights of essential parameters to consider.
但是,对于相同的多功能性,开始舒适地使用它可能会让人不知所措。 如果您正在努力入门,那么这篇文章很适合您。 本文的目的不是概述过多的问题,而是概述您要在日常数据处理任务中使用的关键操作。 每个按键操作都带有一些要考虑的基本参数。
Certainly, the first step is to get the pandas installed, which you can use pip
or conda
by following the instruction here. In terms of coding environment, I recommend Visual Studio Code, JupyterLab, or Google Colab, all of which requires little efforts to get it set up. When the installation is done, in your preferred coding environment, you should be able to import pandas to your project.
当然,第一步是安装熊猫,您可以按照此处的说明使用pip
或conda
。 在编码环境方面,我建议使用Visual Studio Code,JupyterLab或Google Colab,所有这些都无需花费很多精力即可进行设置。 安装完成后,在您喜欢的编码环境中,您应该能够将熊猫导入到您的项目中。
import pandas as pd
If you run the above code without encountering any errors, you’re good to go.
如果您在运行上述代码时没有遇到任何错误,那就很好了。
1.读取外部数据 (1. Read External Data)
In most cases, we read data from external sources. If our data are in the spreadsheet-like format, the following functions should serve the purposes.
在大多数情况下,我们从外部来源读取数据。 如果我们的数据采用电子表格格式,则应使用以下功能。
# Read a comma-separated filedf = pd.read_csv("the_data.csv")# Read an Excel spreadsheetdf = pd.read_excel("the_data.xlsx")
The header should be handled correctly. By default, the reading will assume the first line of data to be the column names. If they are no headers, you have to specify it (e.g.,
header=None
).标头应正确处理。 默认情况下,读数将假定第一行数据为列名。 如果没有标题,则必须指定它(例如,
header=None
)。If you’re reading a tab-delimited file, you can use
read_csv
by specifying the tab as the delimiter (e.g.,sep=“\t”
).如果要读取制表符分隔的文件,则可以通过将制表符指定为分隔符来使用
read_csv
(例如,sep=“\t”
)。When you read a large file, it’s a good idea by reading a small portion of the data. In this case, you can set the number of rows to be read (e.g.,
nrows=1000
).读取大文件时,最好读取一小部分数据。 在这种情况下,您可以设置要读取的行数(例如
nrows=1000
)。If your data involve dates, you can consider setting arguments to make the dates right, such as
parse_dates
, andinfer_datetime_format
.如果数据涉及日期,则可以考虑设置参数以使日期正确,例如
parse_dates
和infer_datetime_format
。
2.创建系列 (2. Create Series)
During the process of cleaning up your data, you may need to create Series yourself. In most cases, you’ll simply pass an iterable to create a Series object.
在清理数据的过程中,您可能需要自己创建Series 。 在大多数情况下,您只需传递一个iterable即可创建Series对象。
# Create a Series from an iterableintegers_s = pd.Series(range(10))# Create a Series from a dictionary object
squares = {x: x*x for x in range(1, 5)}
squares_s = pd.Series(squares)
You can assign a name to the Series object by setting the
name
argument. This name will become the name if it becomes part of a DataFrame object.您可以通过设置
name
参数为Series对象分配name
。 如果该名称成为DataFrame对象的一部分,它将成为名称。You can also assign index to the Series (e.g., setting the
index
argument) if you find it more useful than the default 0-based index. Note that the index’s length should match your data’s length.如果发现
index
比默认的从0开始的索引更有用,则还可以将索引分配给Series (例如,设置index
参数)。 请注意,索引的长度应与数据的长度匹配。If you create a Series object from a dict, the keys will become the index.
如果从dict创建Series对象,则键将成为索引。
3.构造DataFrame (3. Construct DataFrame)
Oftentimes, you need to create DataFrame objects using Python built-in objects, such as lists and dictionaries. The following code snippet highlights two common use scenarios.
通常,您需要使用Python内置对象(例如列表和字典)创建DataFrame对象。 下面的代码片段重点介绍了两种常见的使用场景。
# Create a DataFrame from a dictionary of lists as valuesdata_dict = {'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}
data_df0 = pd.DataFrame(data_dict)# Create a DataFrame from a listdata_list = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]
data_df1 = pd.DataFrame(data_list, columns=tuple('abc'))
The first one uses a dict object. Its keys will become column names while its values will become values for the corresponding columns.
第一个使用dict对象。 其键将成为列名,而其值将成为对应列的值。
The second one uses a list object. Unlike the previous method, the constructed DataFrame will use the data row-wise, which means that each inner list will become a row of the created DataFrame object.
第二个使用列表对象。 与以前的方法不同,构造的DataFrame将按行使用数据,这意味着每个内部列表将成为创建的DataFrame对象的一行。
4. DataFrame概述 (4. Overview of DataFrame)
When we have the DataFrame to work with, we may want to take a look at the dataset at the 30,000 feet level. There are several common ways that you can use, as shown below.
当我们使用DataFrame时,我们可能要看一下30,000英尺高度的数据集。 可以使用几种常用方法,如下所示。
# Find out how many rows and columns the DataFrame hasdf.shape# Take a quick peak at the beginning and the end of the datadf.head()
df.tail()# Get a random sampledf.sample(5)# Get the information of the datasetdf.info()# Get the descriptive stats of the numeric valuesdf.describe()
- It’s important to check both the head and the tail especially if you work with a large amount of data, because you want to make sure that all of your data have been read completely. 检查头部和尾部非常重要,尤其是在处理大量数据时,因为要确保已完全读取所有数据。
The
info()
function will give you an overview of the columns in terms of data types and item counts.info()
函数将为您提供有关数据类型和项目计数的列概述。It’s also interesting to get a random sample of your data to check your data’s integrity (i.e.,
sample()
function).获取数据的随机样本以检查数据的完整性(例如
sample()
函数)也很有趣。
5.重命名列 (5. Rename Columns)
You notice that some columns of your data don’t make too much sense or the name is too long to work with and you want to rename these columns.
您会注意到数据的某些列没有太大意义,或者名称太长而无法使用,并且您想重命名这些列。
# Rename columns using mappingdf.rename({'old_col0': 'col0', 'old_col1': 'col1'}, axis=1)# Rename columns by specifying columns directlydf.rename(columns={'old_col0': 'col0', 'old_col1': 'col1'})
You need to specify
axis=1
to rename columns if you simply provide a mapping object (e.g., dict).如果仅提供映射对象(例如dict ),则需要指定
axis=1
来重命名列。Alternatively, you can specify the mapping object to the
columns
argument explicitly.另外,您可以显式地指定到
columns
参数的映射对象。The
rename
function will create a new DataFrame by default. If you want to rename the DataFrame inplace, you need to specifyinplace=True
.rename
功能将默认创建一个新的DataFrame。 如果要重命名DataFrame inplace ,则需要指定inplace=True
。
6.排序数据 (6. Sort Data)
To make your data more structured, you need to sort the DataFrame object.
为了使数据更加结构化,您需要对DataFrame对象进行排序。
# Sort datadf.sort_values(by=['col0', 'col1'])
By default, the
sort_values
function will sort your rows (axis=0
). In most cases, we use columns as the sorting keys.默认情况下,
sort_values
函数将对行进行排序(axis=0
)。 在大多数情况下,我们使用列作为排序键。The
sort_values
function will create a sorted DataFrame object by default. To change it, useinplace=True
.默认情况下,
sort_values
函数将创建一个排序的DataFrame对象。 要更改它,请使用inplace=True
。By default, the sorting is based on ascending for all sorting keys. If you want to use descending, specify
ascending=False
. If you want to have mixed orders (e.g., some keys are ascending and some descending), you can create a list of boolean values to match the number of keys, likeby=[‘col0’, ‘col1’, ‘col2’], ascending=[True, False, True]
.默认情况下,排序基于所有排序键的升序。 如果要使用降序,请指定
ascending=False
。 如果要混合使用顺序(例如,某些键是升序,某些键是降序),则可以创建一个布尔值列表以匹配键的数量,例如by=['col0', 'col1', 'col2'], ascending=[True, False, True]
。The original index will go with their old data rows. In many cases, you need to re-index. Instead of calling rese
t
_index function directly, you can specify theignore_index
to beTrue
, which will reset the index for you after the sorting is completed.原始索引将与它们的旧数据行一起显示。 在许多情况下,您需要重新编制索引。 您可以将
ignore_index
指定为True
,而不是直接调用reset
_index函数,它将在排序完成后为您重置索引。
7.处理重复 (7. Deal With Duplicates)
It’s a common scenario in real-life datasets that they contain duplicate records, either by human mistake or database glitch. We want to remove these duplicates because they can cause unexpected problems later on.
在现实的数据集中,这种情况很常见,它们包含重复记录,可能是由于人为错误或数据库故障引起的。 我们要删除这些重复项,因为它们可能在以后导致意外问题。
# To examine whether there are duplicates using all columnsdf.duplicated().any()# To examine whether there are duplicates using particular columnsdf.duplicated(['col0', 'col1']).any()
The above functions will return you a boolean value telling you whether any duplicate records exist in your dataset. To find out the exact number of duplicate records, you can use the sum()
function by taking advantage of the returned Series object of boolean values from duplicated()
function (Python treats True
as a value of 1), as show below. One additional thing to note is that when the argument keep
is set to be False
, it will mark any duplicates as True
. Suppose that there are three duplicate records, when keep=False
, both records will be marked as True
(i.e., being duplicated). When keep= “first”
or keep=“last”
, only the first or the last record marked as True
.
上面的函数将返回一个布尔值,告诉您数据集中是否存在重复的记录。 要找出重复记录的确切数目,可以利用sum()
函数,方法是利用来自duplicated()
函数的布尔值返回的Series对象(Python将True
的值视为1),如下所示。 需要注意的另一件事是,当参数keep
设置为False
,它将所有重复项标记为True
。 假设有三个重复记录,当keep=False
,两个记录都将被标记为True
(即,被重复)。 当keep= “first”
或keep=“last”
,仅第一个或最后一个记录标记为True
。
# To find out the number of duplicatesdf.duplicated().sum()
df.duplicated(keep=False).sum()
To actually view the duplicated records, you need to select data from the original dataset using the generated duplicated Series object, as shown below.
要实际查看重复的记录,您需要使用生成的重复的Series对象从原始数据集中选择数据,如下所示。
# Get the duplicate recordsduplicated_indices = df.duplicated(['col0', 'col1'], keep=False)duplicates = df.loc[duplicated_indices, :].sort_values(by=['col0', 'col1'], ignore_index=True)
We get all duplicate records by setting the
keep
argument to beFalse
.通过将
keep
参数设置为False
可以获得所有重复记录。To better view the duplicate records, you may want to sort the generated DataFrame using the same set of keys.
为了更好地查看重复的记录,您可能希望使用同一组键对生成的DataFrame进行排序。
Once you have a good idea about the duplicate records with your dataset, you can drop them like below.
一旦对数据集中的重复记录有了一个好主意,就可以像下面这样删除它们。
# Drop the duplicate recordsdf.drop_duplicates(['col0', 'col1'], keep="first", inplace=True, ignore_index=True)
- By default, the kept record will be the first of the duplicates. 默认情况下,保留的记录将是重复记录中的第一个。
You’ll need to specify
inplace=True
if you want to update the DataFrame inplace. BTW: many other functions have this option, the discussion of which will be skipped most of the time if not all.如果要就地更新DataFrame,则需要指定
inplace=True
。 顺便说一句:许多其他功能都有此选项,如果不是全部,大部分时间都将跳过其讨论。As with the
sort_values()
function, you may want to reset index afterwards by specifying theignore_index
argument (a new feature in Pandas 1.0).与
sort_values()
函数一样,您之后可能希望通过指定ignore_index
参数(Pandas 1.0中的一项新功能)来重置索引。
8.处理丢失的数据 (8. Handle Missing Data)
Missing data are common in real-life datasets, which can be due to measures not available or simply human data entry mistakes resulting in meaningless data that are deemed to be missing. To have an overall idea of how many missing values your dataset has, you’ve seen that the info() function tells us how many non-null values each column has. We can get information about data missingness in a more structured fashion, as shown below.
丢失数据在现实生活数据集中是很常见的,这可能是由于无法采取的措施或者仅仅是由于人类数据输入错误导致了被认为丢失的毫无意义的数据所致。 为了全面了解数据集有多少个缺失值,您已经了解到info()函数告诉我们每列有多少个非空值。 我们可以以更结构化的方式获取有关数据丢失的信息,如下所示。
# Find out how many missing values for each columndf.isnull().sum()# Find out how many missing values for the entire datasetdf.isnull().sum().sum()
The
isnull()
function creates a DataFrame of the same shape as your original DataFrame with each value indicating the original value to be missing (True
) or not (False
). As a related note, you can use thenotnull()
function if you want to generate a DataFrame indicating the non-null values.notull
isnull()
函数创建的形状与原始DataFrame形状相同的DataFrame ,每个值指示缺失的原始值(True
)或不存在(False
)。 作为相关说明,如果要生成指示非空值的notnull()
,则可以使用notnull()
函数。As mentioned previously, a
True
value in Python is arithmetically equal to 1. Thesum()
function will compute the sum of these boolean values for each column (by default, it’s calculating the sum column-wise), which reflect the number of missing values.如前所述,Python中的
True
值在算术上等于sum()
函数将为每一列计算这些布尔值的总和(默认情况下,它按列计算总和),这反映了缺失的数量价值观。
By having some idea about the missingness of your dataset, we usually want to deal with them. Possible solutions include drop the records with any missing values or fill them with applicable values.
通过对数据集的缺失有所了解,我们通常希望对其进行处理。 可能的解决方案包括删除任何缺少的值的记录或使用适用的值填充它们。
# Drop the rows with any missing valuesdf.dropna(axis=0, how="any")# Drop the rows without 2 or more non-null valuesdf.dropna(thresh=2)# Drop the columns with all values missingdf.dropna(axis=1, how="all")
By default, the
dropna()
function works column-wise (i.e.,axis=0
). If you specifyhow=“any”
, rows with any missing values will be dropped.默认情况下,
dropna()
函数按列工作(即axis=0
)。 如果指定how=“any”
,则将删除所有缺少值的行。When you set the
thresh
argument, it requires that the row (or the column whenaxis=1
) have the number of non-missing values.设置
thresh
参数时,它要求该行(或axis=1
时的列)具有非缺失值的数量。As many other functions, when you set
axis=1
, you’re performing operations column-wise. In this case, the above function call will remove the columns for those who have all of their values missing.与其他许多功能一样,当您设置
axis=1
,您将按列执行操作。 在这种情况下,上面的函数调用将删除那些缺少所有值的列。
Besides the operation of dropping data rows or columns with missing values, it’s also possible to fill the missing values with some values, as shown below.
除了删除具有缺失值的数据行或数据列的操作外,还可以用一些值填充缺失值,如下所示。
# Fill missing values with 0 or any other value is applicabledf.fillna(value=0)# Fill the missing values with customized mapping for columnsdf.fillna(value={"col0": 0, "col1": 999})# Fill missing values with the next valid observationdf.fillna(method="bfill")# Fill missing values with the last valid observationdf.fillna(method="ffill")
To fill the missing values with specified values, you set the value argument either with a fixed value for all or you can set a dict object which will instruct the filling based on each column.
要用指定的值填充缺失值,可以将value参数设置为全部固定值,也可以设置dict对象,该对象将根据每一列指示填充。
- Alternatively, you can fill the missing values by using existing observations surrounding the missing holes, either back fill or forward fill. 或者,您可以通过使用围绕缺失Kong的现有观测值来填充缺失值,即回填或正向填充。
9.分组描述统计 (9. Descriptive Statistics by Group)
When you conduct machine learning research or data analysis, it’s often necessary to perform particular operations with some grouping variables. In this case, we need to use the groupby()
function. The following code snippet shows you some common scenarios that apply.
当您进行机器学习研究或数据分析时,通常需要使用一些分组变量来执行特定的操作。 在这种情况下,我们需要使用groupby()
函数。 以下代码段显示了一些适用的常见方案。
# Get the count by group, a 2 by 2 exampledf.groupby(['col0', 'col1']).size()# Get the mean of all applicable columns by groupdf.groupby(['col0']).mean()# Get the mean for a particular columndf.groupby(['col0'])['col1'].mean()# Request multiple descriptive statsdf.groupby(['col0', 'col1']).agg({
'col2': ['min', 'max', 'mean'],
'col3': ['nunique', 'mean']
})
By default, the
groupby()
function will return a GroupBy object. If you want to convert it to a DataFrame, you can call thereset_index()
on the object. Alternatively, you can specify theas_index=False
in thegroupby()
function call to create a DataFrame directly.默认情况下,
groupby()
函数将返回一个GroupBy对象。 如果要将其转换为DataFrame ,则可以在对象上调用reset_index()
。 另外,您可以在groupby()
函数调用中指定as_index=False
以直接创建DataFrame 。The
size()
is useful if you want to know the frequency of each group.如果您想知道每个组的频率,则
size()
很有用。The
agg()
function allows you to generate multiple descriptive statistics. You can simply pass a set of function names, which will apply to all columns. Alternatively, you can pass a dict object with functions to apply to specific columns.agg()
函数使您可以生成多个描述性统计信息。 您可以简单地传递一组函数名称,该名称将应用于所有列。 另外,您可以传递一个dict对象,该对象具有要应用于特定列的函数。
10.宽到长格式转换 (10. Wide to Long Format Transformation)
Depending on how the data are collected, the original dataset may be in the “wide” format — each row represents a data record with multiple measures (e.g., different time points for a subject in a research study). If we want to convert the “wide” format to the “long” format (e.g., each time point becomes a data row and thus a subject has multiple rows), we can use the melt()
function, as shown below.
取决于数据的收集方式,原始数据集可能采用“宽”格式-每行代表具有多种度量(例如,研究对象的不同时间点)的数据记录。 如果我们想将“宽”格式转换为“长”格式(例如,每个时间点变成一个数据行,因此一个主体有多个行),则可以使用melt()
函数,如下所示。
The
melt()
function is essentially “unpivoting” a data table (we’ll talk about pivoting next). You specify theid_vars
to be the columns that are used as identifiers in the original dataset.melt()
函数本质上是“取消透视”数据表(我们接下来将讨论透视)。 您将id_vars
指定为原始数据集中用作标识符的列。The
value_vars
argument is set using the columns that contain the values. By default, the columns will become the values for thevar_name
column in the melted dataset.使用包含值的列设置
value_vars
参数。 默认情况下,这些列将成为融化数据集中var_name
列的值。
11.从长格式到宽格式的转换 (11. Long to Wide Format Transformation)
The opposite operation to the melt()
function is called pivoting, which we can realize with the pivot()
function. Suppose that the created “wide” format DataFrame is called df_long
. The following function shows you how we can convert the wide format to the long format — basically reverse the process that we did in the previous section.
与melt()
函数相反的操作称为pivoting,我们可以通过pivot()
函数来实现。 假设创建的“宽”格式DataFrame称为df_long
。 以下功能向您展示了如何将宽格式转换为长格式-基本上逆转了上一节中的过程。
Besides the pivot()
function, a closely related function is the pivot_table()
function, which is more general than the pivot()
function by allowing duplicate index or columns (see here for a more detailed discussion).
除pivot_table()
pivot()
函数外,一个密切相关的函数是pivot_table()
函数,它比pivot()
函数更通用,它允许重复的索引或列(请参见此处以获取更详细的讨论)。
12.选择数据 (12. Select Data)
When we work with a complex dataset, we need to select a subset of the dataset for particular operations based on some criteria. If you select some columns, the following code shows you how to do it. The selected data will include all the rows.
当我们处理复杂的数据集时,我们需要根据一些条件为特定操作选择数据集的子集。 如果选择一些列,则以下代码向您展示如何执行此操作。 所选数据将包括所有行。
# Select a columndf_wide['subject']# Select multiple columnsdf_wide[['subject', 'before_meds']]
If you want to select certain rows with all columns, do the following.
如果要选择具有所有列的某些行,请执行以下操作。
# Select rows with a specific conditiondf_wide[df_wide['subject'] == 100]
What if you want to select certain rows and columns, we should consider using the iloc
or loc
methods. The major difference between these methods is that the iloc
method uses 0-based index, while the loc
method uses labels.
如果要选择某些行和列,应该考虑使用iloc
或loc
方法。 这些方法之间的主要区别在于iloc
方法使用基于0的索引,而loc
方法使用标签。
- The above pairs of calls create the same output. For clarity, only one output is listed. 上面的调用对创建相同的输出。 为了清楚起见,仅列出了一个输出。
When you use slice objects with the
iloc
, the stop index isn’t included, just as regular Python slice objects. However, the slice objects include the stop index in theloc
method. See Lines 15–17.当您将切片对象与
iloc
一起使用iloc
,不包括stop索引,就像常规的Python切片对象一样。 但是,切片对象在loc
方法中包含停止索引。 参见第15-17行。As noted in Line 22, when you use a boolean array, you need to use the actual values (using the values method, which will return the underlying numpy array). If you don’t do that, you’ll probably encounter the following error:
NotImplementedError: iLocation based boolean indexing on an integer type is not available
.如第22行所述,当您使用布尔数组时,需要使用实际值(使用values方法,这将返回基础的numpy数组)。 如果不这样做,则可能会遇到以下错误:
NotImplementedError: iLocation based boolean indexing on an integer type is not available
。The use of labels in
loc
methods happens to be the same as index in terms of selecting rows, because the index has the same name as the index labels. In other words,iloc
will always use 0-based index based on the position regardless of the numeric values of the index.就选择行而言,在
loc
方法中使用标签恰好与索引相同,因为索引的名称与索引标签的名称相同。 换句话说,iloc
将始终基于位置使用基于0的索引,而不管索引的数值如何。
13.使用现有数据的新列(映射并应用) (13. New Columns Using Existing Data (map and apply))
Existing columns don’t always present the data in the format we want. Thus, we often need to generate new columns using existing data. Two functions are particularly useful in this case: map()
and apply()
. There are too many possible ways that we can use them to create new columns. For instance, the apply()
function can have a more complex mapping function and it can create multiple columns. I’ll just show you two most use common cases with the following rules of thumb. Let’s keep our goal simple — just create one column with either use case.
现有列并不总是以我们想要的格式显示数据。 因此,我们经常需要使用现有数据来生成新列。 在这种情况下,两个函数特别有用: map()
和apply()
。 我们可以使用太多方式来创建新列。 例如, apply()
函数可以具有更复杂的映射函数,并且可以创建多个列。 我将通过以下经验法则向您展示两个最常用的案例。 让我们保持目标简单-只需使用任一用例创建一列。
If your data conversion involves just one column, simply use the
map()
function on the column (in essence, it’s a Series object).如果您的数据转换仅涉及一列,则只需在该列上使用
map()
函数(本质上是一个Series对象)。If your data conversion involves multiple columns, use the
apply()
function.如果您的数据转换涉及多个列,请使用
apply()
函数。
In both cases, I used lambda functions. However, you can use regular functions. It’s also possible to provide a dict object for the
map()
function, which will map the old values to the new values based on the key-value pairs with keys being the old values and the values being the new values.在两种情况下,我都使用lambda函数。 但是,您可以使用常规功能。 也可以为
map()
函数提供一个dict对象,该对象将根据键值对将旧值映射到新值,其中键为旧值,而值为新值。For the
apply()
function, when we create a new column, we need to specifyaxis=1
, because we’re accessing data row-wise.对于
apply()
函数,当我们创建新列时,我们需要指定axis=1
,因为我们要逐行访问数据。For the apply() function, the example shown is intended for demonstration purposes, because I could’ve used the original column to do a simpler arithmetic subtraction like this:
df_wide[‘change’] = df_wide[‘before_meds’] —df_wide[‘after_meds’]
.对于apply()函数,所示示例仅用于演示目的,因为我可以使用原始列来进行如下更简单的算术减法:
df_wide['change'] = df_wide['before_meds'] —df_wide['after_meds']
。
14.串联与合并 (14. Concatenation and Merging)
When we have multiple datasets, it’s necessary to put them together from time to time. There are two common scenarios. The first scenario is when you have datasets of similar shape, either sharing the same index or same columns, you can consider concatenating them directly. The following code shows you some possible concatenations.
当我们有多个数据集时,有必要不时将它们放在一起。 有两种常见方案。 第一种情况是,当您拥有形状相似的数据集(共享相同的索引或相同的列)时,可以考虑直接将它们连接起来。 以下代码显示了一些可能的连接。
# When the data have the same columns, concatenate them verticallydfs_a = [df0a, df1a, df2a]
pd.concat(dfs_a, axis=0)# When the data have the same index, concatenate them horizontallydfs_b = [df0b, df1b, df2b]
pd.concat(dfs_b, axis=1)
- By default, the concatenation performs an “outer” join, which means that if there are any non-overlapping index or columns, all of them will be kept. In other words, it’s like creating a union of two sets. 默认情况下,串联执行“外部”联接,这意味着如果存在任何不重叠的索引或列,则将全部保留它们。 换句话说,这就像创建两个集合的并集。
Another thing to remember is that if you need to concatenate multiple DataFrame objects, it’s recommended that you create a list to store these objects, and perform concatenation just once by avoiding generating intermediate DataFrame objects if you perform concatenation sequentially.
要记住的另一件事是,如果需要连接多个DataFrame对象,建议您创建一个列表来存储这些对象,并通过顺序执行串联操作避免生成中间DataFrame对象,从而只执行一次连接。
If you want to reset the index for the concatenated DataFrame, you can set
ignore_index=True
argument.如果要重置串联的DataFrame的索引,则可以设置
ignore_index=True
参数。
The other scenario is to merge datasets that have one or two overlapping identifiers. For instance, one DataFrame has id number, name and gender, and the other has id number and transaction records. You can merge them using the id number column. The following code shows you how to merge them.
另一种情况是合并具有一个或两个重叠标识符的数据集。 例如,一个DataFrame具有ID号,名称和性别,而另一个具有ID号和交易记录。 您可以使用ID号列合并它们。 以下代码显示了如何合并它们。
# Merge DataFrames that have the same merging keysdf_a0 = pd.DataFrame(dict(), columns=['id', 'name', 'gender'])
df_b0 = pd.DataFrame(dict(), columns=['id', 'name', 'transaction'])
merged0 = df_a0.merge(df_b0, how="inner", on=["id", "name"])# Merge DataFrames that have different merging keysdf_a1 = pd.DataFrame(dict(), columns=['id_a', 'name', 'gender'])
df_b1 = pd.DataFrame(dict(), columns=['id_b', 'transaction'])
merged1 = df_a1.merge(df_b1, how="outer", left_on="id_a", right_on="id_b")
When both DataFrame objects share the same key or keys, you can simply specify them (either one or multiple is fine) using the
on
argument.当两个DataFrame对象共享一个或多个相同的键时,您可以使用
on
参数简单地指定它们(一个或多个都可以)。When they have different names, you can specify which one for the left DataFrame and which one for the right DataFrame.
当他们有不同的名称,你可以指定一个左数据框和一个合适的数据帧 。
By default, the merging will use the inner join method. When you want to have other join methods (e.g., left, right, outer), you set the proper value for the
how
argument.默认情况下,合并将使用内部连接方法。 当您要使用其他联接方法(例如,左,右,外)时,可以为
how
参数设置适当的值。
15.放置列 (15. Drop Columns)
Although you can keep all the columns in the DataFrame by renaming them without any conflict, sometimes you’d like to drop some columns to keep the dataset clean. In this case, you should use the drop()
function.
尽管您可以通过重命名所有列将其保留在DataFrame中,而不会发生任何冲突,但是有时您还是希望删除一些列以保持数据集的整洁。 在这种情况下,应使用drop()
函数。
# Drop the unneeded columnsdf.drop(['col0', 'col1'], axis=1)
By default, the
drop()
function uses labels to refer to columns or index, and thus you may want to make sure that the labels are contained in the DataFrame object.默认情况下,
drop()
函数使用标签来引用列或索引,因此您可能需要确保标签包含在DataFrame对象中。To drop index, you use
axis=0
. If you drop columns, which I find them to be more common, you useaxis=1
.要删除索引,请使用
axis=0
。 如果删除列(我发现它们更常见),则使用axis=1
。Again, this operation creates a DataFrame object, and if you prefer changing the original DataFrame, you specify
inplace=True
.同样,此操作将创建一个DataFrame对象,并且如果您希望更改原始DataFrame ,则可以指定inplace
inplace=True
。
16.写入外部文件 (16. Write to External Files)
When you want to communicate data with your collaborators or teammates, you need to write your DataFrame objects to external files. In most cases, the comma-delimited files should serve the purposes.
当您想与合作者或队友交流数据时,您需要将DataFrame对象写入外部文件。 在大多数情况下,以逗号分隔的文件应能达到目的。
# Write to a csv file, which will keep the indexdf.to_csv("filename.csv")# Write to a csv file without the indexdf.to_csv("filename.csv", index=False)# Write to a csv file without the headerdf.to_csv("filename.csv", header=False)
By default, the generated file will keep the index. You need to specify
index=False
to remove the index from the output.默认情况下,生成的文件将保留索引。 您需要指定
index=False
才能从输出中删除索引。By default, the generated file will keep the header (e.g., column names). You need to specify
header=False
to remove the headers.默认情况下,生成的文件将保留标题(例如,列名)。 您需要指定
header=False
来删除标题。
结论 (Conclusion)
In this article, we reviewed the basic operations that you’ll find them useful to get you started with the pandas library. As indicated by the article’s title, these techniques aren’t intended to handle the data in a fancy way. Instead, they’re all basic techniques to allow you to process the data in the way you want. Later on, you can probably find fancier ways to get some operations done.
在本文中,我们回顾了基本操作,您会发现它们对开始使用pandas库很有用。 如文章标题所示,这些技术并非旨在以一种奇特的方式处理数据。 相反,它们都是允许您以所需方式处理数据的基本技术。 稍后,您可能会找到更理想的方法来完成一些操作。
翻译自: https://towardsdatascience.com/nothing-fancy-but-16-essential-operations-to-get-you-started-with-pandas-5b0c2f649068
熊猫直播 使用什么sdk
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387847.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!