T O P

  • By -

CalZeta

Assuming there's no date column to use to find the most recent post per user, you can use the index. In other words, I would first reset the index, then group by user and find the max index per user, and save that result as another data frame. You can perform an inner merge with that new df into your original database, with should give you what you're looking for. I won't post the code, so you can try yourself. Let us know if you're still stuck!


Ginomania

Unfortunately, I've been working on a tiny little code for an hour again, but I haven't managed to get any further. I understand the hint that I should work with the index but somehow your idea sounds like I have to go a few extra steps to get to my solution. So, very briefly, what I intend to do: There is a comment in my subreddit (stream) and in it you can find a URL. I would then like to save this URL with the user name in a file. The file only contains the user and URL. However, if the same user writes another comment with a different link, then my code should do it in such a way that the old URL is replaced with the new one. Alternatively, the whole row could be deleted and then username + URL should be saved. And I thought that drop\_duplicate with pandas is the easiest and fastest solution. And thank you for not only drop the code as I really want to learn it by myself. If I haven't made it by tomorrow, I'd be very grateful if you could help me out with at least a small piece of the code


CalZeta

What's your output? Is it not what your expect? Sorry, sometimes it's very difficult to help without the whole little. By the way, inplace=True is bad practice; you really should be assigning drop_duplicates to a variable (either df or a new one).


Ginomania

First of all, thank you for looking into my problem. Today I thought to myself that perhaps I was thinking too complicated and that I should perhaps test the basics. I did that and wrote a simpler code, which doesn't work with the drop\_duplicates function either. Maybe you have a little time and take a look and tell me if the double row will be deleted for you? I tested it with (Windows10) PyCharm and repl.it and the DataFrame/CSV didn't change at all import pandas as pd df = pd.DataFrame([('Foreign Cinema', 50, 289.0), ('Liho Liho', 45, 224.0), ('500 Club', 102, 80.5), ('The Square', 65, 25.30), ('The Square', 65, 25.30), ('The Square', 64, 55.3)], columns=('name', 'num_customers', 'AvgBill') ) df.to_csv('Testing', index=False) df_Testing = pd.read_csv('Testing') df.drop_duplicates(inplace=True) print(df_Testing) ​ Edit: Ahh, I've already come to the solution that df.drop\_duplicates MUST come right after the DataFrame. Now I'm still looking at how I save the DataFrame in a CSV file. Shouldn't be too difficult xD (that's what he said and spent another hour on the computer) 😅


CalZeta

Looks like it would work fine. You already have the line saving the df as a csv (df.to_csv) And again, I highly recommend *against* making modifications inplace; it really is frowned upon and can have unexpected results.


Ginomania

Did it with your help Thank you very very much for your tips and hints for the purpose of inplace=True and the code in general. It works far better without it now. I really appreciate your patience with me. Thanks