the problem
I have a set of dataframes where each row should have a unique ID value, but sometimes imported data has multiple rows with the same ID. I want to do a groupby
operation or something similar that checks each ID group for special characters indicating “no data,” checks that each group column has exactly one non-special value (which could be repeated in multiple rows) and then collapses ID groups into one row whose column values are either unique non-special column values from the group or special values in columns where no row in the group had data.
I want the code to throw an error if there are multiple different non-special values in a group column. So if the ID column label is 'id'
and special characters are in 'x?'
then col2
below should raise an exception
id src col1 col2 col3
0 f x 5 m 1
1 f a 5 n ?
I’ve put together the following solution but I think there must be a way simpler method. My main question is whether I can do this with more native pandas code instead of using multiple helper functions as I’ve done below.
my code
I’m currently accomplishing this with a function called merge_rows
that uses three helper functions:
get_nonspecial_values
gets all non-special values in a column
check_column_consistency
checks if a column has exactly one non-special value
get_inconsistent_ids
gets a (hopefully empty) set of ID values whose rows conflict with each other, like with ID value 'f'
above.
I think the last two lines in get_inconsistent_ids
and merge_rows
could probably be done with a built-in pandas operation which would then make a bunch of the code I’ve written unnecessary, but I can’t figure out what that operation would be (I’m also not sure those are the lines to focus on when revising this code).
import pandas as pd
def get_nonspecial_values(series, specials):
'''
takes a series representing a column and returns all
values in the column that are not in specials
'''
counts = series.value_counts()
return (v for v in counts.index if str(v) not in specials)
def check_column_consistency(series, specials):
'''
takes a series representing a column and returns True
if there is exactly one value that is not in specials
'''
nonspecial = get_nonspecial_values(series, specials)
if len(nonspecial) not in (1, 0):
return False
else:
return True
def get_inconsistent_ids(df, specials, id_column):
'''
takes a dataframe, special values, and a column label
returns all ID values that represent a set of rows in which
at least one column has multiple different nonspecial values
'''
where_nonunique_ids = df(id_column).duplicated(keep=False)
if not any(where_nonunique_ids):
return ()
else:
rows_with_nonunique_ids = df.loc(where_nonunique_ids)
check_group = lambda x: all(x.apply(check_column_consistency, args=(specials,)))
where_ids_consistent = rows_with_nonunique_ids.groupby(id_column).apply(check_group)
return (v for v in where_ids_consistent.index if not where_ids_consistent(v))
def merge_rows(df, specials='', id_column=''):
'''
performs the merge operation described above
'''
inconsistent_ids = get_inconsistent_ids(df, specials=specials, id_column=id_column)
if len(inconsistent_ids) > 0:
raise ValueError('found the following ID values with inconsistent rows', inconsistent_ids)
rows_with_nonunique_ids = df.loc(df(id_column).duplicated(keep=False))
unique_column_value = lambda x: (get_nonspecial_values(x, specials) or ('x'))(0)
get_group_values = lambda x: x.apply(unique_column_value)
values_by_id = rows_with_nonunique_ids.groupby(id_column).apply(get_group_values)
return pd.DataFrame(data=values_by_id.values, columns=df.columns)
df = pd.DataFrame({'id' : ('a','b','c','c','d','d','d'),
'src': ('x','a','a','a','a','a','a'),
'col1':( 5, 5, '?','?', 5, 5, 'x'),
'col2':('x','n','n','x','n','x','x'),
'col3':( 1 , 3 ,'?', 9 ,'?', 8 ,'x')})
print(merge_rows(df, specials='x?', id_column='id'))
The code above gives me the following output
id src col1 col2 col3
0 c a x n 9
1 d a 5 n 8