我想拆分数据并按行和列获取值
Posted
技术标签:
【中文标题】我想拆分数据并按行和列获取值【英文标题】:I want to split the data and get value by rows and columns 【发布时间】:2019-11-04 20:51:02 【问题描述】:我想将数据集与行和列一起拆分,将数据集拆分为 80:20% 的比例,其中 80% 是训练数据,20% 是测试数据。但我可以将数据集分成 80%,但不能分成 20%。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
city_attributes = pd.read_csv('./input/city_attributes.csv')
humidity = pd.read_csv('./input/humidity.csv')
pressure = pd.read_csv('./input/pressure.csv')
temperature = pd.read_csv('./input/temperature.csv')
weather_description = pd.read_csv('./input/weather_description.csv')
wind_direction = pd.read_csv('./input/wind_direction.csv')
wind_speed = pd.read_csv('./input/wind_speed.csv')
# we can reshape these using pd.melt
humidity = pd.melt(humidity, id_vars = ['datetime'], value_name = 'humidity', var_name = 'City')
pressure = pd.melt(pressure, id_vars = ['datetime'], value_name = 'pressure', var_name = 'City')
temperature = pd.melt(temperature, id_vars = ['datetime'], value_name = 'temperature', var_name = 'City')
weather_description = pd.melt(weather_description, id_vars = ['datetime'], value_name = 'weather_description', var_name = 'City')
wind_direction = pd.melt(wind_direction, id_vars = ['datetime'], value_name = 'wind_direction', var_name = 'City')
wind_speed = pd.melt(wind_speed, id_vars = ['datetime'], value_name = 'wind_speed', var_name = 'City')
# combine all of the dataframes created above
weather = pd.concat([humidity, pressure, temperature, wind_direction, wind_speed, weather_description], axis = 1)
weather = weather.loc[:,~weather.columns.duplicated()] # indexing: every row, only the columns that aren't duplicates
# now we can merge this with the city attributes
weather = pd.merge(city_attributes,weather, on = 'City')
weather = weather.dropna()
first = pd.DataFrame()
rest = pd.DataFrame()
total_size = weather.shape[0]
train_size = 1277055
test_size = 319264
if len(weather) > train_size:
first = weather[:1277055]
rest = weather[319264:]
print(rest)
test data output
train data output
【问题讨论】:
您得到什么错误或意外结果?您导入了train_test_split
但未使用。该功能应该完全满足您的需求。
通过使用train_test_split 数据可以按列划分,不能按行划分,我已经测试过了。
【参考方案1】:
目前你的代码是
train_size = 1277055
test_size = 319264
if len(weather) > train_size:
first = weather[:1277055]
rest = weather[319264:]
将 rest 定义为第 319264 行之后的所有行,而 first 正确地是前 1277055 行。也许你想要
train_size = 1277055
test_size = 319264
if len(weather) > (train_size + test_size):
first = weather.iloc[:train_size, :]
rest = weather.iloc[(train_size + 1):(train_size + test_size + 1), :] # same as weather[1277056:1596320, :]
或者使用 sklearn 的 train_test_split:
train_size = 1277055
test_size = 319264
train_idx, test_idx = train_test_split(weather.index, train_size = train_size , test_size = test_size )
df_train = weather.iloc[train_idx, :]
df_test = weather.iloc[test_idx, :]
示例用法:
In [1]: import numpy as np
...: import pandas as pd
...: train_size = 1277055
...: test_size = 319264
...: weather = pd.DataFrame(np.random.randint(0,100,size=(train_size+test_size, 4)), columns=list('ABCD'))
...: print(weather.head())
A B C D
0 13 91 68 35
1 52 30 52 59
2 16 22 73 24
3 62 86 27 96
4 88 54 23 4
In [2]: if len(weather) >= (train_size + test_size):
...: print('subsetting')
...: first = weather.iloc[:train_size, :]
...: rest = weather.iloc[(train_size + 1):(train_size + test_size + 1), :]
...:
...: print(first.shape)
...: print(rest.shape)
...:
subsetting
(1277055, 4)
(319263, 4)
【讨论】:
对于您提到的第一个结果,我得到以下输出 Empty DataFrame Columns: [] Index: [] 您能告诉我您使用了两个提议的版本中的哪一个吗?另外,您是否在原始代码中插入了代码,替换了if len(weather)...
块?
我用过这个,是的,我已经在原始代码中插入了代码,可能需要先在这里进行一些更正 = pd.DataFrame() rest = pd.DataFrame() if len(weather) > (train_size + test_size):
好的,作为旁注,请记住` first = pd.DataFrame() rest = pd.DataFrame() ` 不是必需的,因为天气 df 的子集返回不同的 df 对象。您的天气 df 的总大小是多少?
你确定天气 df 至少有 (train_size + test_size) 行吗? weather.shape 的输出是什么?【参考方案2】:
要在位置 x 处拆分数组,请使用
left = array[:x]
right = array[x:]
与相同 x
。因为x
是一个位置,而不是一个计数。
【讨论】:
以上是关于我想拆分数据并按行和列获取值的主要内容,如果未能解决你的问题,请参考以下文章