Pandas
Subset a df according to string present in columns name
df.loc[:, df.columns.str.startswith('alp')]
df.loc[:, df.columns.str.contains('alp')]
Select a df according to column positions
Here from column seven till the end
df.iloc[:, 7:]
Rename columns
df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)
Rename columns or index
df_new = df.rename(columns={'A': 'Col_1'}, index={'ONE': 'Row_1'})
Didnt work for me
df_merged_selected.index.rename('Row ID', inplace = True) does
Rename columns by position
df.rename(columns={ df.columns[1]: "your value" }, inplace = True)
Add prefix (or suffix) to colnames
DataFrame.add_prefix(prefix)
Check df datatype
df.dtypes
Convert to a specific type
df.year.astype(int)
From continuous to categorical
pd.cut(df.Age,bins=[0,2,17,65,99],labels=['Toddler/Baby','Child','Adult','Elderly'])
Merge two df based on index
pd.merge(df1, df2, left_index=True, right_index=True)
Merge two df based on columns
pd.merge(student_df, staff_df, how='left', left_on='Name', right_on='Name')
replace specific string in values
df['Column2'] = df.Column2.str.replace('b,?' , '')
Replace string by Nan
metadata_lat_lon_df = metadata_lat_lon_df.replace('nd', np.nan)
Here we observe full NANA rows and full NAN columns in the latest df. We drop them
selected_samples.dropna(how='all', inplace = True) selected_samples.dropna(how='all', axis = 1, inplace = True)
display all infos of a df
data_train.info(verbose=True, show_counts=True)
drop column according to regex
df = df[df.columns.drop(list(df.filter(regex='Test')))]
drop columns according to list
# %%
colsToDrop = [ 'BARCODE', 'PLATESET', 'WELL',
'SUBSTANCE_NAME', 'Full_Species', 'Genus',
'Sp_alone', 'Species', 'Famille']
df_merged_selected=df_merged.drop(colsToDrop, axis=1)
df_merged_selected
If Na replace with value of the same row but another column
https://stackoverflow.com/a/29177664
df.Temp_Rating.fillna(df.Farheit, inplace=True)
del df['Farheit']
df.columns = 'File heat Observations'.split()
Extract digits from a string
https://stackoverflow.com/a/37683738
df.A.str.extract('(\d+)')
Create multiples columns values conditionally using np.where
https://stackoverflow.com/a/19913845
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
(df['Set'] == 'Z') & (df['Type'] == 'A'),
(df['Set'] == 'Z') & (df['Type'] == 'B'),
(df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)
drop duplicates according to multiple coloumns
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
df.drop_duplicates(subset=['a', 'b'], keep='first', inplace=True, ignore_index=False)
append df created in a for loop
https://stackoverflow.com/a/28670223
appended_data = [] for infile in glob.glob("*.xlsx"): data = pandas.read_excel(infile)
# store DataFrame in list
appended_data.append(data)
see pd.concat documentation for more info
appended_data = pd.concat(appended_data)
write DataFrame to an excel sheet
appended_data.to_excel('appended.xlsx')