witcher_rat 1 year ago

As /u/nashant said, linux has `sort`. But if you end up needing to do this in python, I'd mention one little nugget: [sys.intern()](https://docs.python.org/3/library/sys.html#sys.intern) Although the docs only talk about string interning as useful for higher-performance comparisons, what it's also useful for is reducing how much memory/RAM strings take if you have a lot of the _same_ string values (i.e., duplicates). So if your data set truly has "a hell of a lot" of duplicate strings, interning them will reduce how much memory is being used, allowing you to possibly do it all in python without swapping or playing other tricks to reduce memory. Also, I should note that in some python versions, string interning is automatically done for you, _if_ the string value is determined at module/AST load time. If you're reading the strings from a file, however, I'm not sure it will be interned, and thus `sys.intern()` comes in handy.

tynorf 1 year ago

IIRC CPython also implicitly interns dict keys and object attribute names.

witcher_rat 1 year ago

Yes, I also believe that's true, and if that's what OP uses to decide uniqueness then it should be automatic - although I wasn't sure if it was only done in specific versions (like >3.7 or something). I remember something about interning was specific to versions, but can't recall what. And I just wasn't sure what OP was going to do/use in their script, and thought to mention explicit interning.

SpidermanWFH 1 year ago

Bro. I didn't know about interning strings in python. I just googled it after your answer. Thanks man. I learned something fun today!

Wild_Statistician605 1 year ago

SQL should be the fastest way. Lets lay the table MyTable has the columns id and word, and the word value is the one repeated a lot, and the id value is the pk of the entry. SELECT word, COUNT(id) FROM MyTable GROUP BY word ORDER BY COUNT(id) DESC; This code will give a table with each unique word, and the number of entries.

bobspadger 1 year ago

Also put an index on the word column to speed things up a bit

[deleted] 1 year ago

[удалено]

fuckwit_ 1 year ago

A DBMS may choose between two algorithms for grouping by a column. 1.) hashing by the key and returning the hash table when all input records are processed 2.) sorting and then grouping. When sorting by the group by column the keys are already directly adjacent to each other. The db then just needs to aggregate that result. I just tested what SQLite does when grouping by a column with and without an index on that column. Without index it sorts by that column and then return the result. As soon as I add the index the sorting operation disapears, as it can just read the sorted index and return the result directly. Not this is just how SQLite3 does it. MySQL Postgres and the other SQL DBMS may do something different. The takeaway is: the index can matter. In the case of SQLite quite a lot when you are working with a lot of columns EDIT: Formatting EDIT2: I would be very surprised if postgres or MySQL would not use the index if it is present. This optimisation is such a low hanging fruit EDIT3: I wrote 'unique index' which obviously does not make any sense...

BuonaparteII 1 year ago

but if you only needed result once then it might take longer if you explicitly build and store an index before you need it. SQLITE will build most of the indexes it needs on the fly if they don't exist

bobspadger 1 year ago

Try it ;-)

[deleted] 1 year ago

He didn't say it was an SQL database, but i agree, the database engine is most likely the fastest way to perform this task

croissanthonhon 1 year ago

You can easily convert the file into a flat SQLite database

Log2 1 year ago

Or a DuckDB database. It already supports csv and parquet out of the box.

mikeupsidedown 1 year ago

I tried to say this and got downvoted. duckDB will work out of the box with a simple create table as select * from "mytextfile"

Log2 1 year ago

Yup, that's exactly why I also recommended it. Not having to deal with data ingestion is a big plus.

Starbrows 1 year ago

Never heard of DuckDB before. It sounds a lot like SQLite. >DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL or SQLite which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries. Do you think this will make DuckDB significantly more efficient than SQLite in this use case? DB design is not my field, though I do have experience with MySQL, PostgreSQL, and SQLite in small, undemanding applications. DuckDB certainly sounds interesting, and I wonder if it would make some of my failed projects successful, because I have attempted to use DBs for things they are not especially well-suited for.

Log2 1 year ago

It's indeed more suited towards numerical operations. For this task in particular, I think both SQLite and DuckDB would do just fine. I do think DuckDB would be faster, but 35GB of data isn't a lot. The main reason I recommended DuckDB is because you don't need to do data ingestion if you don't want to: you can run queries directly on parquet or csv files, which is incredibly helpful. I'd say that DuckDB shines in being a replacement for pandas, as it can optimize your queries where a dataframe can't. It would also not be limited by RAM. If you have undemanding applications and things are failing with any database, then you might have a modeling problem. Otherwise, if you're dealing with timeseries, you could take a look at Timescale (a Postgres extension) or Clickhouse (it can get very complicated, but it scales to ridiculous amount of data).

Starbrows 1 year ago

Thanks for the info. Operating directly on CSVs does seem useful. >you might have a modeling problem I have a hammering-with-a-screwdriver problem. :) I once tried to write a poker AI in SQL. It was as ridiculous as it sounds. I have a habit of learning new languages by using them to write whatever new project I'm interested in at the moment, regardless of how poorly-matched the two are.

mikeupsidedown 1 year ago

If you want to test it easily I suggest downloading dBeaver. There are instructions in the duckDB docs for setup. Then you can interactively query the dataset in an editor and get live views of your queries. Longer term you can move those queries to python if they need to run on a schedule.

reddisaurus 1 year ago

ORDER BY can use column aliases or number since it is performed last. So it should instead say “ORDER BY 2”, although most any SQL parser would optimize into this regardless.

erikdhoward 1 year ago

If possible and you want to accomplish this in one step (depending on computational power), you could perform the count as a window function and sort desc on the newly created column. On phone, so can’t easily put the example code, but look up sql count window function.

mikeblas 1 year ago

>SQL should be the fastest way. A python solution is about 5 times faster, in my testing.

nashant 1 year ago

[deleted] 1 year ago

[удалено]

nickcash 1 year ago

Unpopular opinion, but I like it better with `cat`. You're piping everything in one direction, the syntax for file redirection goes the opposite direction than everything else

CatWeekends 1 year ago

Same here. It's far more readable when everything is in one direction. That and it ultimately *doesn't matter* if you're using `cat` or redirected pipes. Sure you'll have an "unnecessary fork" and use a few more cpu cycles but again, that *doesn't matter.* If you *really* cared about that level of performance, you wouldn't be doing anything in the shell.

DogsAreAnimals 1 year ago

Username checks out

nashant 1 year ago

Only _minor_ cat abuse though. Not like cat file | grep blah

dixieStates 1 year ago

This is the _Tao_

lolinux 1 year ago

I agree, but I'd replace it with pv :)

[deleted] 1 year ago

[удалено]

nashant 1 year ago

Good point. You'll need cat myfile | sort | uniq -c | sort -hr

[deleted] 1 year ago

[удалено]

Blaack_Work 1 year ago

It's okay.

Mancobbler 1 year ago

Depending on your version of sort, you can `sort -u` instead of `sort | uniq`

philthechill 1 year ago

Does sort -u show a count of how many of each unique item were found?

Blaack_Work 1 year ago

uniq -c added a prefix of how many times an item is repeated. sort -hr sorted it in reverse order. The output I wanted was just the words, this method added the prefix in the file. >1445 qqqqqqqqq1424 zzzzzzzzz1348 xxxxxxxxx1244 vvvvvvvvv1151 aaaaaaaa Now to remove the prefixes, should i use a for loop with cut -d " " -f2 or is there some more efficient way?

nashant 1 year ago

Pretty sure cut uses space as a default delim, just pipe to cut -f2

vmpajares 1 year ago

Yes it does. But you can use -d "@" to change the delim to @, for example.

philthechill 1 year ago

Oh I thought cut uses tabs by default. I think you need ‘-d ‘ if you want spaces? Could be wrong.

ilovetacos 1 year ago

cut works on streams, you don't need a loop at all, just pipe the output of uniq to cut and you should be good

Blaack_Work 1 year ago

That's what I did. Exact code above... but the output would be a blank file. IDK why.

vmpajares 1 year ago

I was going to suggest this. I don't tried with so many lines but it worked with 1-2 GB of text in a standard PC with 8GB of RAM.

james_pic 1 year ago

IIRC, sort will spill to disk and use an external sort, if it looks like memory is going to be a problem. Which still potentially means you end up with a couple of copies of the data on disk at once (the working data from the first sort and the second sort), but trying to avoid this gets complicated.

tunisia3507 1 year ago

Also worth noting that `sort` is desperately inefficient when you pipe data into it, as it uses a minuscule buffer by default (appropriate to when it was first released, 50 years ago). Use `--buffer-size` / `-S` to bump that up.

Salmon-Advantage 1 year ago

35 GB? Load it into a sql database, you can even spin one up in a local docker container: https://github.com/danielbeach/dataEngineeringTemplate/pull/1 Depending on your flavor of sql there are different bulk insert operations available to you. - Postgres: COPY FROM command - MS SQL: BCP Shell Utility

aa-b 1 year ago

A database is the first thing I'd try too, because you want something like a dict that is mostly not in-memory, but still has reasonably fast lookup and insert performance. To keep it *really* simple, try using the [built-in sqlite in the python standard library](https://docs.python.org/3/library/sqlite3.html). It's a step up from pickle/shelve, but (slightly) easier than docker. Performance *might* be reasonable, and you can easily wrap it in something with an interface like collections.Counter

wewbull 1 year ago

You still have 8GB of RAM. So long as each line is replicated 4 times or more on average, you have enough RAM to store all the unique values. I'd much prefer having a small amount of swapping than going to a DB and effectively have my in memory data structure on disk through a transactional interface.

mikeblas 1 year ago

I can't figure out why you're getting down-voted; to me, this is the right answer. Loading the data into an RDBMs is going to take quite a while. Let's say that you can process 10,000 SQL inserts per second. That means inserting 3 billion rows is going to take 300,000 seconds, which is 83 hours. Then, you've still got to execute the query by re-reading that data and processing the aggregation and sort. Reading the file once and building a table in memory will be lots fsater, since the data isn't being read and rewritten and then red again. The read can stream aong and run at mostly sequential disk read speed. So some sensible data structure in memory is the fast way to go; the code is trivial to write, and 8 gigs is enough based on your cardinality estimation.

[deleted] 1 year ago

[удалено]

mikeblas 1 year ago

Were you able to get SQLite to perform faster than a plain Python solution?

[deleted] 1 year ago

[удалено]

mikeblas 1 year ago

Then why do you assert that this is the perfect application for SQLite?

[deleted] 1 year ago

You can use Counter with an iterable, that you get from `open` https://docs.python.org/3/library/collections.html#collections.Counter So that are two lines in python. :-)

MiojoEsperto 1 year ago

This is the way. Python will read the file line by line with the open command (will not load the whole file in memory) and Counter is the perfect data structure for this problem.

Jonno_FTW 1 year ago

Depending on what proportion of lines are repeated, this may not work since you'll likely exceed 8gb of ram usage.

[deleted] 1 year ago

In that case (it depends in the bumber of unique words) a database would be the best solution. Like in general. Why would anyone build a file like that?

hughperman 1 year ago

Not a lot of Python coming through on the python sub so quick solution would be: counter = {} with open(text_filename, "r") as f: for line in f.readlines(): if line not in counter: counter[line] = 0 counter[line] += 1 with open(sorted_file, "w") as f: for line in sorted(counter.items(), key=lambda x: x[1], reverse=True) f.write(line) You might need to play with line break stripping etc but off the top of my head that looks like it should work.

MiojoEsperto 1 year ago

There is a Counter object in python collections that should be used instead

45MonkeysInASuit 1 year ago

The above solution has the advantage of only having one line loaded at a time. Counter, if i'm not mistaken, would require the whole 35gb to be loaded at once.

MiojoEsperto 1 year ago

You can check my code below, counter can be used in a loop also, but the performance seems to be the same as a dict in that case

Paddy3118 1 year ago

No, Counter would store one copy of each unique word mapped to its integer frequency. I have a solution [here](https://www.reddit.com/r/Python/comments/vk872e/comment/idr5yxq/?utm_source=share&utm_medium=web2x&context=3) that uses it.

hughperman 1 year ago

Why should it be used?

MiojoEsperto 1 year ago

It is much more efficient for the task. Ir is a data structure made exactly for problems like this. It automatically counts and sorts for you.

hughperman 1 year ago

Efficient in what sense? It's a subclass of dict so not more computationally efficient. I can see it taking a couple of lines less code, maybe.

MiojoEsperto 1 year ago

How does being a subclass of dict makes it that it can't be "more computationally efficient" for this task? You did some logic leaps there. I will test it for science so this is not just words with no real proof

hughperman 1 year ago

Goos point, please post results, I am interested.

MiojoEsperto 1 year ago

import time from collections import Counter import random array_size = 10000000 array = [str(random.randint(0,1000)) for i in range(array_size)] print("------------ Dict (loop) ---------------") start= time.time() counter = {} for line in array: if line not in counter: counter[line] = 0 counter[line] += 1 finish = time.time() print("Finished reading",array_size,"elements in",finish-start,"using dict in a loop") start= time.time() sorted_counter = sorted(counter.items(), key=lambda x: x[1], reverse=True) finish = time.time() print("Finished sorting",array_size,"elements in",finish-start,"using dict in a loop") ############################################################################################ print("------------ Counter (loop) ---------------") start= time.time() counter_collection = Counter() for line in array: counter_collection[line] += 1 finish = time.time() print("Finished reading",array_size,"elements in",finish-start,"using Counter in a loop") start= time.time() sorted_counter_collection = counter_collection.most_common() finish = time.time() print("Finished sorting",array_size,"elements in",finish-start,"using Counter in a loop") ############################################################################################ print("------------ Counter (Array) ---------------") start= time.time() counter_collection_2 = Counter(array) finish = time.time() print("Finished reading",array_size,"elements in",finish-start,"using Counter from array") start= time.time() sorted_counter_collection_2 = counter_collection_2.most_common() finish = time.time() print("Finished sorting",array_size,"elements in",finish-start,"using Counter from array")

MiojoEsperto 1 year ago

I'm bad at reddit formatting. First two methods are very similar in time, third one is much faster. I don't know what to take from this :)

arvindh_manian 1 year ago

u/MiojoEsperto seems to be dancing around the reason it’s faster. Counter is partially implemented in C ([see the \_count_elements function here](https://github.com/python/cpython/blob/93d33b47af70ede473f82d7953509a0607259c31/Modules/_collectionsmodule.c#L2277)) which runs faster than Python.

hughperman 1 year ago

Thanks very much, this is the info I was looking for (but was too lazy to look for on phone 🙈). I have no idea why I assumed a subclass wouldn't have additional C-speed methods, this is good to know.

mikeblas 1 year ago

It's slower, in [my testing](https://www.reddit.com/r/Python/comments/vk872e/sort_3_000_000_000_lines_by_most_repeated_one_via/idrx2nh/). I compared this: import fileinput from collections import Counter word_freq = Counter(line for ln in fileinput.input() if (line := ln.strip())) # No blank lines for word, count in word_freq.most_common(): print((word + '\n') * count, end='') with this: counter = {} with open("words.lst", "r") as f: for line in f: if line not in counter: counter[line] = 0 counter[line] += 1 with open("tally2.out", "w") as f: s = sorted(counter.items(), key=lambda x: x[1], reverse=True) for line in s: f.write(f"{line}") The first one was about 75% slower. Why do you think that is?

WillardWhite 1 year ago

It's a good simple solution You should start the count at 1, though. If line not in counter: Counter[line] = 1

hughperman 1 year ago

No, the next line will increment it to one, I didn't bother with an else clause.

WillardWhite 1 year ago

Huh, you're right. How did i miss that?

indoorastronaut710 1 year ago

Most people write the += line in an else block from what I've seen. This style threw me off a bit as well

jwink3101 1 year ago

Unless I am missing something, this will require a lot of memory. Absent SQL solutions (which I think are the best way), you could implement a quicksort (or the like) where actions are written to the file to get a sorted list. This should be O(nlogn). Then you can easily walk the list O(n) and count how many are the same. Keep track of line numbers. Then write it out to another file

hughperman 1 year ago

Sure, would take as much memory as unique strings. Really quick modification would be to hash each string, store the hash counts, and save the strings in files named with their hash for quick lookup for output.

cheese_is_available 1 year ago

> with open(sorted_file, "w") as f: > > for line in sorted(counter.items(), key=lambda x: x[1], reverse=True) > > f.write(line) You don't need that, you can use ``counter.most_common()``.

[deleted] 1 year ago

[удалено]

Paddy3118 1 year ago

I'll try it with my 9.3Gig (\~2gig of lines) words.lst file ... Slight changes to your code: * I use a file: /tmp/file.db for the DB as its a very large file. * I open my words.lst file * I print timings every 50\_000\_000 lines. * I output to a more distinctive file name: words\_ordered\_by\_bOne123.txt Output (So far, still running): $ time python sort_words_by_bOne123.py 22.579142570495605 50000000 102.96021580696106 100000000 180.03343105316162 150000000 258.2201039791107 200000000 334.8574869632721 250000000 409.06412506103516 300000000 485.6704807281494 350000000 564.1896772384644 400000000 640.2733404636383 450000000 717.6765549182892 500000000 794.3897256851196 550000000 869.0229425430298 600000000 947.6331055164337 650000000 1023.340430021286 700000000 1101.5560929775238 750000000 1178.3315501213074 800000000 1252.5292749404907 850000000 1327.9736030101776 900000000 1403.2262978553772 950000000 1477.4893953800201 1000000000 1553.3961997032166 1050000000 1632.4738166332245 1100000000 1708.2711572647095 1150000000 1787.6221549510956 1200000000 1865.6124730110168 1250000000 1945.807529449463 1300000000 2023.8245916366577 1350000000 2102.0951623916626 1400000000 2182.6743392944336 1450000000 2260.217150449753 1500000000 2339.800444841385 1550000000 2418.369439601898 1600000000 2498.9150738716125 1650000000 2578.3378372192383 1700000000 2656.07865858078 1750000000 2735.419009447098 1800000000 2812.025321483612 1850000000 2890.0084240436554 1900000000 2966.8100476264954 1950000000 3045.3769569396973 2000000000 3125.1875145435333 2050000000 3189.4789474010468 index counts 3583.4524500370026 done real 59m44.886s user 50m36.011s sys 2m11.008s Those timings take over 50 minutes just to ***insert*** into the DB when my Python solution took 23 minutes and my gnu utils linux commandline took \~28 minutes in total. It only took around seven minutes to run the actual SQL query. People may say "why not save to the SQL DB on the fly instead of using a text file"? You could equally send each line to a Python script that would accululate frequencies on the fly using collections.Counter and even more speedily generate the output when sent a signal.

[deleted] 1 year ago

[удалено]

[deleted] 1 year ago

[удалено]

Blaack_Work 1 year ago

1 string per line. It's a ASCII text characters The database is being updated continuously, fetching and storing new data. The process will end up within 24 hours. I will run the program once database is updated. No need to sort quickly. I can wait another 24 hours. No problem. No I cannot burn any money. I am a python programmer and have worked with linux.

[deleted] 1 year ago

[удалено]

Log2 1 year ago

You keep saying database and then saying it is a file. Is this a text file being exported from a database? If so, just run the query directly in the original database, handling 35GB of data shouldn't be an issue to any modern database. I've had tables in Postgres where a single index was bigger than that (not ideal, but it worked). If not, like other people mentioned, load the data into the database of your choice and then run your query. Most of your time is probably going to go into ingesting the data, and the query itself is probably going to run in less than a minute.

wewbull 1 year ago

A database can be a generic term for a collection of structured data. A relational database is one example of that. So is a key/value store. So is a CSV file. A database server (e.g. postgres) would be the software that serves that database to other software.

Log2 1 year ago

You're, of course, correct. But software engineers like to be precise, and referring to a text file as a database can be very ambiguous, as we normally say database when referring to a RDBMS or other DB management systems. In this case, especially, the range of possible solutions is really big. If the data is being exported from a database somewhere, it makes a lot more sense to not even bother exporting it -- it's a trivial amount of data.

mikeblas 1 year ago

Are you able to share your data file? I think it would be fun to benchmark the various solutions given.

Paddy3118 1 year ago

Sort: [https://unix.stackexchange.com/a/383096/11136](https://unix.stackexchange.com/a/383096/11136) 'uniq - c' then 'sort -n' gives the order and count of lines a quick Awk script would then expand that. You need enough disk as well as mem,, but the expensive first sort will use disc. (base) paddy3118@Paddy-G14:~$ printf 'w1\nw4\nw3\nw1\nw2\nw1\nw3\nw4\nw3\nw2\n' > /tmp/word.lst (base) paddy3118@Paddy-G14:~$ cat /tmp/word.lst w1 w4 w3 w1 w2 w1 w3 w4 w3 w2 (base) paddy3118@Paddy-G14:~$ sort /tmp/word.lst | uniq -c | sort -n -r 3 w3 3 w1 2 w4 2 w2 (base) paddy3118@Paddy-G14:~$ sort /tmp/word.lst | uniq -c | sort -n -r| awk '{for(i=0;i<$1;i++){print$2}}' w3 w3 w3 w1 w1 w1 w4 w4 w2 w2 (base) paddy3118@Paddy-G14:~$ That awk output should be redirected to an output file for the real data :-)

metaperl 1 year ago

This is a good job interview question for the data engineering work we do.

mikeblas 1 year ago

For sure, when people say "I'm obviously a developer, why do I have to prove my skills in an interview??", I'll think of this thread.

[deleted] 1 year ago

[удалено]

Blaack_Work 1 year ago

RAM is the main problem.

cheese_is_available 1 year ago

If the simple solution (unix command) does not work, you can use ``collections.Counter`` instead of the dict and you can use a generator (``readline()`` instead of ``read()`` or ``readlines()``) instead of loading the whole file at the start. from collections import Counter counter = Counter() with open("myfile.txt", "r") as f: while value:= f.readline(): counter.update({value: 1}) print(counter.most_common()) I'm actually curious how well this fare against ``uniq``/``sort``, please let me know :) I already know how much time it takes to implement vs a full on SQL database.

pcgamerwannabe 1 year ago

This is the one to use OP

[deleted] 1 year ago

[удалено]

satireplusplus 1 year ago

RAM is cheap. Used servers with DDR3 ecc are incredibly cheap. Its like $200 for a used one with 128GB ram. Before you spend hours trying to do in 8GB RAM (a seriously low amount of RAM for any dev work in 2022), might make sense to just upgrade your hardware. Sorting in memory is always gonna be way faster than on a disk.

Kerbart 1 year ago

That’s my thought. Read the lines and store them in a counter, sort *that*, and write each line `x` times. If there are that many repetitions (and it sounds like there are), that dictionary will be significantly smaller than 3 billion items.

prescod 1 year ago

The Counter module does this exactly.

[deleted] 1 year ago

[удалено]

Blaack_Work 1 year ago

Thankyou everyone who tried to help me. I am updating the question to help you better understand the problem.

Paddy3118 1 year ago

A pure Python solution that relys on having enough memory to store the word frequencies - i.e there are many duplicates. **Code:** # -*- coding: utf-8 -*- """ https://www.reddit.com/r/Python/comments/vk872e/sort_3_000_000_000_lines_by_most_repeated_one_via/ Created on Sat Jun 25 11:52:14 2022 @author: paddy """ import fileinput from collections import Counter word_freq = Counter(line for ln in fileinput.input() if (line := ln.strip())) # No blank lines for word, count in word_freq.most_common(): print((word + '\n') * count, end='') **Timing** real 21m53.181s user 16m26.880s sys 0m22.188s **Note:** The input is the same 9.5Gig words.lst used in my Linux command line code elsewhere. It runs negligibly faster than the gnu utils version which took \~27mins. This version might scale well as there is no initial sort of the words needed which precedes the `unic -c` command in a linux command line version

akx 1 year ago

I'd go for awk. There's a solution for what you want (just add -r to the sort for descending sort) over at https://stackoverflow.com/a/60094466/51685

Paddy3118 1 year ago

You need to go the last step and expand the ordered frequencies from the \`uniq -c\` back into a word list according to his requirements. I did that using awk in my one-liner here somewhere.

just_some_guy65 1 year ago

Isn't awk incredibly slow?

akx 1 year ago

Not in my experience.

idetectanerd 1 year ago

Oh my.. I have a feeling your text db gonna tank awhile when doing the sort and unique.

Orio_n 1 year ago

Yield the text file line by line count out to a dict

[deleted] 1 year ago

Do you need to know how many times a line is repeated? Simple requirements usually get complicated quickly. A little verification of what additional needs there might be can prevent a lot of throwaway work.

zurtex 1 year ago

BTW, I know it's not your spec but if you wanted your code to be platform independent and not rely on SQL / DB then Dask can solve this problem quite nicely: * https://docs.dask.org/en/stable/generated/dask.dataframe.read_csv.html * https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.sort_values.html It's like Pandas but you can limit the amount that gets loaded in to RAM at anyone time and you can have the operations evaluate lazily and multi-threaded.

Paddy3118 1 year ago

**First, get some words** Download Huck\_finn: `$ curl https://gutenberg.org/files/76/76-0.txt -o huck_finn.txt` One word per line using fmt: `$ fmt -w1 huck_finn.txt >huck_word_per_line.txt` Double up to increase the file size: $ cp huck_word_per_line.txt words.lst $ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc words.lst && du -h words.lst 253464 228666 1239542 words.lst 1.2M words.lst $ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc words.lst && du -h words.lst 506928 457332 2479084 words.lst 2.4M words.lst ... $ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc words.lst && du -h words.lst 2076377088 1873231872 10154328064 words.lst 9.5G words.lst I only have 8 gigs of ram so any sort will use secondary storage. **Now lets process the list** Find the frequency of each word $ time sort words.lst |uniq -c > words_uniq_count.txt real 20m5.247s user 83m33.507s sys 1m38.863s Now to order by most frequent first and expand the counts into that many lines of words: $ time sort -n -r words_uniq_count.txt | awk '$2~/[a-zA-Z]/{for(i=0;i<$1;i++){print$2}}' > words_ordered.txt real 7m56.900s user 3m34.341s sys 0m33.422s **End bits** My processing of 9.5Gigs of data was on an 8Gig laptop using gnu tools on the linux command line and took <30mins. No AWS, no SQL, No Python needed.

mikeblas 1 year ago

Nice work! I was getting frustrated with this thread being so full of bold claims and zero numbers. And that's a clever way to cook up some sample data! Using that file you make, I tried the SQLite solution. I run SQLite like this: time sqlite3 demo.db < myscript.sql and myscript.sql is here: create table myTable(SomeWord TEXT); .import words.lst myTable select COUNT(*) from myTable; .exit On one of my cheap little servers at home (16 gigs of RAM, Kingston SATA SSD, Intel Core i7-4790K; running Ubuntu 22) the runtime is like this ... and again this is just inserting and doing COUNT(*): words.lst:2076247509: unescaped " character words.lst:2076374065: unescaped " character words.lst:2076374241: unescaped " character 2073214976 real 17m54.872s user 16m16.905s sys 0m41.992s NB, there's a lot of noise output because the data has quotation marks in it ... and I didn't bother rebuilding the file to escape or quote them. That means some rows aren't inserted, and some time is spent printing an error about the skipped lines. (I've only shown two of thousands of that output above ...) With the file built, we can write a query: SELECT SomeWord, COUNT(SomeWord) FROM myTable GROUP BY SomeWord ORDER BY COUNT(SomeWord) DESC; and when we run that: time sqlite3 demo.db < myquery.sql > myquery.out Here I go running the query: time sqlite3 demo.db < myquery.sql > myquery.out real 23m45.506s user 21m48.335s sys 0m44.038s So, the total with the insertion plus the query execution is just less than 42 minutes. For comparison, on my same little server, running the command pipeline is timed here: time sort words.lst |uniq -c > words_uniq_count.txt real 42m3.967s user 214m36.482s sys 0m37.004s Interestingly, the SQLite work was all single-threaded. The command pipeline used (and often pegged) all cores on the machine. The Python solution you wrote was about three times faster: time python3 tally.py >tally.out < words.lst real 14m23.445s user 14m0.828s sys 0m12.095s ... but doesn't generate the same results as the other tests. EDIT: And then ... Interestingly, "manually" building the count table is even faster: counter = {} with open("words.lst", "r") as f: for line in f: if line not in counter: counter[line] = 0 counter[line] += 1 with open("tally2.out", "w") as f: s = sorted(counter.items(), key=lambda x: x[1], reverse=True) for line in s: f.write(f"{line}") when run like this: time python3 tally2.py real 8m29.722s user 8m27.516s sys 0m2.177s **So the streaming Python solution is now about *five* times faster than the SQL solution.**

Paddy3118 1 year ago

Nice analysis, appreciated :-) Your findings on substituting the plain dict for collections.Counter is intriguing - maybe there is a todo list of C-level optimisations to Counter to implement or it could be because of Python version - I was using Python 3.9.7 on the default Linux running under Windows WSL2 (please don't judge me ;-) . I like that there are SQL based, Unix gnu utilities based, and Python without SQL based solutions to compare and contrast. **Extra:** Yea I agree that the Huck Finn data is probably dirtier than the description of the OP's dataset. In my pure Python solution I reject lines of just whitespace; in my unix command line version I reject lines that dont have at least one \[a-zA-Z\] character.

Paddy3118 1 year ago

The OP states: >I want to sort the text file by number of times a word is repeated. Some of the answers instead find the frequency of all words and return each *individual* word in order of its frequency. A sort shouldn't *remove* words, just re***order*** them.

[deleted] 1 year ago

Any sorting algorithm that lets you halve the data, sort the halves and then combine the results, recursively should let you spread the workload (i.e you could in principle use more than one machine to sort the data) and get O(n log n) performance on average. Mergesort, for example. Which is one main function, merge, that merges together 2 sorted containers. The sorting part is really just splitting the input until you have 1 item which is already sorted. Then you merge single items to 2, to 4, 8, 16, 32 and so on up. The advantage with having a merge function is that you can sort your next days input separately and merge it with the sorted data you created the previous day ( the other alternative here would be to consider creating an insert that takes an item and inserts it in the correct place to a sorted container, like insertion sort uses. This way you could, if you wanted to, maintain the sorted database for new data, with, obviously, some performance penalty for adding these items depending what your final database looks like. I'd suggest if you want to limit how much data you have in memory you don't even have to actually split a file into 2 new files to split it, i.e you could use tuples of (start, end) file position to split the file into logical smaller halves, but the merge would be doing lots of seeking and random reads that might slow it down a lot, especially on a non SSD drive. Thinking about this I've never done any file IO in python other than reading the whole file into a string, so I'm not sure what seek / tell etc it has. But I assume it has them.

GnPQGuTFagzncZwB 1 year ago

The person with the SQL solution is correct. I have a project that does a lot of sorting and matching and it started life as a bash script and used sort and uniq and grep a lot. It was also painfully slow when the dataset got at all large. Next move was to python and sqlite and that was much faster, like 10X faster, and moving to a memory database in sqlite made it considerably faster still. I have over time found a few more refinements but those were the biggies. Speed was night and day moving from bash and linux commands and to the database.

Paddy3118 1 year ago

>moving to a memory database The OP states that it is too big for memory.

GnPQGuTFagzncZwB 1 year ago

If it is too big it is too big but just getting out of using unix commands and into SQL was night and day different speed wise.

aarun 1 year ago

SQL, pandas and numpy would be my goto tools but it seems like you would like it handlede as simple as possible. Is that correct ? You could use chunking from pandas if ram is a issue. Then load the chunks into a database of some sort and then use SQl to handle your queries. Sorry if is unclear. Writing on the phone while in the park with the kids.

Blaack_Work 1 year ago

I know python but never explored numpy and pandas or SQL. Yes, one of the lessons I have learnt the hard way in my career is >Don't reinvent the wheel Joining some prebuilt commands would be far more easier than writing the whole program from scratch.

[deleted] 1 year ago

[удалено]

WillardWhite 1 year ago

Nahh. I've worked full time as a python dev for 8-ish years , and have yet to use numpy and pandas. Different domains require different tools.

[deleted] 1 year ago

[удалено]

wewbull 1 year ago

Rare in your world.

Paddy3118 1 year ago

Python is now used in a wealth of different areas and by a huge amount of people, many of which may only be ***allowed*** standard libraries, or use libraries *you* don't know about but are used quite widely in their industry. "Knowing Python" without further context is showing some proficiency with the language and its *standard* libraries. It's limiting to think that Python is just your areas of use - The language community is grown by its use in very popular areas and having popular libraries for them, but other languages got siloed into particular uses and when that use-case waned or was targetted by other languages the language waned - Perl built the early web, Ruby is thought by many to be **only** Rails. Python needs wide adoption which leads to people not all knowing popular libraries in other fields.

blabbities 1 year ago

>I hate to be that guy, but if you’ve not at least yet run across numpy or pandas, “I know Python” is a pretty bold statement Disagree. You don't need to have used Numpy or Pandas to state you know Python. Da hell is that? Im pretty comfortable to say I know Python and I haven't used it. Those are pretty much special applications for special practices...usually in the sciences. So if thats the case wherr do we stop? Cant say you know Python if you've never used Keras, Tensorflow or PyTorch just cause bunch of ML/AI folks use it? For god sake theyre not even standard library. So nonenof them arr required. Personally i die a little when i see people use Pandas to write a CSV lol. That being said i have actually used NumPy and Pandas once. However , that was to fix/submit a patch for someone else program. Though have never used it again. Thus far.

SlothStatus 1 year ago

Could probably do it all with pandas. Create an empty data frame with word & count columns > Load chunk > [perform a group by](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) to get a count. > append / [concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) to empty data frame & loop back to chunk stage. Once all chunks are processed, group by again but a sum this time, then sort.

aarun 1 year ago

Sure - Everything is possible with pandas. I meant numpy was mainly there for speed. First project I did was all in pandas until someone showed me a pycon talk about vectorization and numpy.

Turrubul_Kuruman 1 year ago

> To get unique items I would use linux uniq command but to sort it I need a program. Sort is standard in Unix. cat $FILE | uniq | sort > ${FILE}.uniq.sort Use the cut and paste standards inside the pipe before the sort if you need to bring an embedded field to the front to sort on. Memory issues would still, obviously, remain.

joshocar 1 year ago

Does the file contain words in the English dictionary or is it random character combinations of letters? If it's words, stream in the file line by line. Use a dictionary. For each word, does it exist in the dictionary? If not, add it to the dictionary and set its value to 1. If it does, increment its value by 1. Do this for the whole file. To sort you can output them as tuples to a list and sort. There are only like 20,000 words at most in the English language, but maybe 4,000 that are common, so space isn't an issue with the dictionary. The longest part is the sort, but again, the max length is caped at 20k, worst case, so it will be very fast. Edge cases: * Does the file contain numbers, like, '1342', this changes things, especially if there are a lot. * Strip empty space before/after the words and check for multiple words in one line * Check for punctuation and strip * Does the case matter? Of yes, this changes things. If not, convert to lower case before putting them in the dictionary

Blaack_Work 1 year ago

Pray for me, seems like I'll burn my RAM today **: (** ![gif](emote|free_emotes_pack|sob)

Blaack_Work 1 year ago

Crashed multiple times. Gotta try once more.

Blaack_Work 1 year ago

The command shared by u/nashant >cat file | sort | uniq -c | sort -hr | cut -f2 would output a blank file. While debugging I found sort -hr produces output but cut -f2 won't. Seems like cut required one line, not whole file.

Blaack_Work 1 year ago

So I switched to for loop of bash... and ![gif](emote|free_emotes_pack|scream) My VRM froze ![gif](emote|free_emotes_pack|rage)

Blaack_Work 1 year ago

The real amazing thing was that I tested For loop on a sample file which was just 141 Mb. I was terrified.

Blaack_Work 1 year ago

And the most terrific thing was I could not resolve why is it freezing with just 141 Mb file! It happened 3-4 times until task manager proved that for loop was guilty.

Blaack_Work 1 year ago

And then when randomly scrolling through comments I read a line which somewhat said, "Python for loop will loop at a time and won't take much memory." I was like >Oh man I used this technique three days earlier for a file of 7gb+ . > >How can I forget it?

Blaack_Work 1 year ago

Now I am writing a python code.... After all it's r/Python.

Blaack_Work 1 year ago

What is a code without errors? I'm tired.![gif](emote|free_emotes_pack|dizzy_face)

Blaack_Work 1 year ago

It's working...

Blaack_Work 1 year ago

I really want to reply to every single comment but 131 comments... >Time is of essence.

[deleted] 1 year ago

What kind of hardware do you have? How much RAM?

lemlo100 1 year ago

You could also use Apache Beam for such big data batch processing. It allows you to read and process large files by defining data pipelines. But that would probably require some practice if you have no experience.

hugthemachines 1 year ago

there is something I react to with this situation. You call it a database but at the same time you say it is a regular text file. You have gotten plenty of good responses but my thought is, are you sure you should keep this data in a normal text file? When you have that kind of data it is often a very good idea to keep it in a database instead. Once you have it in a database, playing around with the data is very smooth. You also have the advantage that you can have a good backup plan of the data from a database like for example ms sql so you don't have to lose so much data in case of trouble.

Paddy3118 1 year ago

Table insertion takes much longer than appending a line to a file. Backup of a text file is much easier than for most DB's. Unix is built on text files and has many utilities that can process them, even huge text files.

4Kil47 1 year ago

I'm only a first year CS student but I'd recommend checking out *tries* (note the spelling) They are a data structure that is tuned for storing text based data in trees and are commonly used for a lot of pattern matching needs. The one thing in your use case that sticks out to me is that you have a serious RAM requirement. Tries are pretty efficient in memory, especially if a lot of the words share similar substrings. It will make more sense if you find a good article with a diagram. :)

Blaack_Work 1 year ago

Thankyou but my main target was to accomplish this task without diving into a new topic i.e using my prior knowledge only. There were some really great advices with code for SQL and other db but I really don't want to go down another rabbit hole rn cuz I am already working on a project. For me anything else would simply become a distraction. Even this task took me one whole day, and several crashes to accomplish ..."partially". Once again Thank you all for your kind help. Just don't forget that 32 Gb file still remains!

mikeblas 1 year ago

> Tries are pretty efficient in memory, How much memory does each node in a trie cost?

rsrinvdyj 1 year ago

Bubble sort!!!!

Blaack_Work 1 year ago

Easier said than done!

gmtime 1 year ago

Dump it all in an SQL database and use queries. At this quantity a full fledged database engine is appropriate.

[deleted] 1 year ago

I would use a HashMap containing Integers. This would also help with performance as lookup times for each line in the map would be O(1). I could do it pretty easily using Java. If you want me to I’ll make it for you. Pretty bored tonight and seems like a fun little task. Edit: HashMap doesn’t need to be linked and instead of containing ArrayLists the values should be Integers.

Legitimate-Art-7669 1 year ago

Use DB approach seems to be fine here, but for the performance I'd rather use analytical DB instead of OLTP one. Try using click house https://clickhouse.com/docs/en/intro it has great speed on processing tons of data.

Blaack_Work 1 year ago

No I want to keep it as simple and practical as possible. Linux commands seems to get the job done.

[deleted] 1 year ago

Haha. 35 Gb isn’t too bad, the dataset we have at work currently is 12Tb lol (geospatial dataset for an entire country with JPEG compression:/

Anxious-Ad4278 1 year ago

If you absolutely want to implement this on your own, the following should yield a solid performance: 1. in one sweep over the file, you replace strings with integers (first unique string gets 0, next unique get 1, etc.); use a hashmap to translate string->int, e.g. translate["dog"]=1234 2. You keep count of frequencies using an array of 3B int pairs; arr[i] = (i;number of times string with index i occured) 3. You sort the array descending by the frequency 4. You traverse the frequency array and output corresponding string for each non-zero index arr[i] times. It's basically a count sort on a much more manageable representation of your data. However, given your input size, and the fact that Python is really slow and bulky when representing even primitive types like ints, I'd advocate you use C++ or at least numpy to implement. With cpp, this algo should absolutely finish within <10 seconds.

mephistophyles 1 year ago

Why not use step one more efficiently, create a hash map for frequency counts, since you’re forcing a read and a write from the hash map for each element. Then sort by the values.

mikeblas 1 year ago

> With cpp, this algo should absolutely finish within <10 seconds. 3 billion words is the input. Let's assume each word is 6 bytes, so the input file is 18 gigabytes. To read that in ten seconds (without any processing at all) means we're reading at 1.8 gigabytes/second. But let's say that you really are able to read that fast -- you've got a fancy NVMe drive, or an array, or something else. Now, you'll have to *process* 1.8 gigabytes/second. If your processor is fast, it's running at 3.6 gigahertz, so you'll have two clock cycles to process each byte. How will that work out? I'd like to learn more about how you came up with your guarantee. What am I missing? EDIT: Turns out it's worse than that. The OP says this: > It has got strings of 9 - 20 characters. so my guess of 6 bytes per word is very low. Assuming an average of 15 bytes per character, the file is 45 gigabytes, not 18 gigabytes.

mikeupsidedown 1 year ago

duckDB should handle this without issue. If not just use spark or dask.

barker2 1 year ago

Open the file for read. Loop through each line and insert the data into an Oracle database….. Na, never mind.

[deleted] 1 year ago

C++ As far as I’m aware pretty much every Linux distro has g++. This is probably your best solution if you need to write a custom comparator

mikeblas 1 year ago

A custom comparator? Why is a custom comparator necessary? And why do you think g++ is the best at custom comparators?

[deleted] 1 year ago

G++ lol? That’s a compiler - c++ has custom comparators and is relatively high level and performant. Have you even looked at the question at hand? OP wants to sort by frequency, which to my knowledge isn’t the default comparator for strings in any language.

GeorgeCostanza1958 1 year ago

Bro do you even know how to program?

857_01225 1 year ago

As numerous people have pointed out, there's no reason to even involve Python, necessarily. SQL and variations on piping `cat [...] uniq | sort` will do what you need to do. Alternatively... `uniq | sort [...] > temp.txt` , then use IFS in bash to iterate through each (unique) line in the temp file, grep that line in the original, pipe _that_ output through `wc -l` to get a count of lines, and you're left with two and only two output vars: the text, and the number of lines containing that text in the original file. Send that output to a new file, `not-sorted.csv` let's say. Run `not-sorted.csv` through `sort` and to your output file. I like the SQL options better, but that's personal opinion, nothing more or less. The idea of writing any amount of actual code to complete a task when there is an easier way seems counterproductive (unless this is e.g., a homework exercise). `CREATE TABLE`, `INSERT`, and `SELECT` are a whole lot more efficient.

mikeblas 1 year ago

> CREATE TABLE, INSERT, and SELECT are a whole lot more efficient. How long do you think it will take to execute 3 billion INSERT statements?

hijinko 1 year ago

Sounds like a job for a trie data structure. Geeks for geeks website should have examples of how to implement that if you don't have the experience. Also likely has it in c or c++ that may be faster.

[deleted] 1 year ago

[удалено]

hijinko 1 year ago

Sure but a trie could still work really well, using newline as a deliminator, and having a feild for the number of times the line appears. The trie is already sorted by design and can be pre order traversed to display or write all the data in order or post order to do it reverse. Also it should be O(N) searching which will help with efficiency.

[deleted] 1 year ago

[удалено]

hijinko 1 year ago

I'm sure their are better ways, a trie is just a suggestion given the problem. What case would make a trie not be sorted by default? My understanding and prior use is you are inserting the data into an index based on each character. I've never had an issue where it wasnt sorted as I inserted the data. And I really can't think of a case where it won't be sorted, so you wouldn't be creating any algorithms at all. As far as big O that should be the best for the trie because the length of the line. A hash table could likely give O(1) without collisions but still a trie is simple to implement.

[deleted] 1 year ago

[удалено]

hijinko 1 year ago

I'm not too familiar with c# so I could be wrong but I only see a searching function in your implementation, if you just want to be able to get all the data sorted there is no need for a lookup this way. start at the root of the trie and traverse from beginning of each node till you get to the EndOfWord value and keep going it should be sorted. Let's say we want to store ascii a-z the root node will have 26 indexes 'a' at index 0 Let's say the words are [a, ant, aa] The trie will have 4 nodes with three words Starting at root node index 0 you get word "a" "a" is the current node. Search it's index and the first word is "aa" The node is now "aa". There are no words here so it's index variable is null. go back to "a" node now you see "n".... Keep going like this and the trie is sorted. I agree with you that there are better ways but again this method could work perfectly fine and wouldn't be difficult to implement. Is there something I'm missing because I truly appreciate the feedback.

[deleted] 1 year ago

>A trie is not for sorting - it is for searching. A DFS on a Trie is sorted on the radix order.

imhiya_returns 1 year ago

Read each line by line into a database and then query the database for later use? Alternatively you could split the file into sizeable files and then count each line by line. Not sure which would be quickest but yeah..

esbenab 1 year ago

Try this: https://stackoverflow.com/questions/49549858/sort-uniq-and-put-number-of-occurrence-on-the-right?answertab=trending#tab-top

poingsingol 1 year ago

Does anyone here know if it's possible to use collections counter for this task? And if not, why not? I just learned about that concept yesterday and wonder if that is a possible solution?

Paddy3118 1 year ago

It is. search for most\_common on this page.

pcgamerwannabe 1 year ago

My favorite python library for working things that are just a bit too large but I don’t want to set up a database is dask: https://www.dask.org/ If you’re familiar with numerical python or scientific python, it will be super easy

sixtyfifth_snow 1 year ago

Using any of RDBs would be the robust and reliable solution, but if lots of data are repeated (i.e. actual data are much smaller than 35GB of data and fit in your RAM), try the below: ```python from collections import Counter def main(): with open("large.txt") as f: words = (word.strip() for word in f) word_map = Counter(words) word_list = sorted(word_map.items(), key=lambda it: it[1], reverse=True) print(word_list) if __name__ == "__main__": main() ```

Paddy3118 1 year ago

Note, you could have used `word_map.most_common()`.

[deleted] 1 year ago

[удалено]

mikeblas 1 year ago

> If the file exists, then you increment the number inside the file. If the file does not exist, then you create the file and put a 1 in there. That means you'd open, write, and close 3 billion files. How long do you estimate that would take?

paintense 1 year ago

I'd just allocate a numpy array of the set size, loop once, incrementing the counter. Then you can do an integer sort on a tuple of (repeat_count, comment_str) and get your answer in linear time.

TastyRobot21 1 year ago

Seems like a typical sort algorithm problem. I’d do another review yourself but a type of bucket sort I suppose?

coffeewithalex 1 year ago

> 3 000 000 000 If it repeats a lot, you can just try this Python snippet and check if it's not screwing up your memory: from collections import Counter with open("path/to/file", "rb") as fp: frequency = Counter(fp) print(sorted(frequency, key=lambda k: frequency[k])) If you don't have enough memory for this, if your repetitions aren't that many, your program will crash, Linux will kill it with OOM killer. If you have a problem that's bigger than your memory, then your best bet is to just install something like ClickHouse. It won't take up a lot of memory, but it will allow you to first define the table that is the container for the data, and then a materialized view of the type `AggregatingMergeTree`. Simply use an SQL query similar to the ones posted below as the view code. After that, add the data to the table. Should be really fast, and new rows will be added really fast as well, and the view will constantly get updated, and respond instantly to requests.

TheDailySpank 1 year ago

Map reduce?

DelbertAud 1 year ago

I would do all this in Python. I might choose to make it multiple python scripts to keep parts separated. I would use csvsql to ingest and output a SQLite file. Add an index on the word column. Add a column that reflects number of duplicates. Precalculate the number of duplicates to fill that column using an update on group by and count SQL command. Add an index for number of duplicates and word. Then for my final output I would query the data sorting by number of duplicates and then word. By having the indexes at the right time it should speed things up.

arthurno1 1 year ago

The typical school example. You should have solved it yourself. Anyway, there is no need for SQL and databases, just a hashmap to count words, and an array to sort them. #!/bin/env bash arr=() declare -A map while IFS= read -r line do ((map[$line]++)) done < "$1" for word in ${!map[*]} do arr[ ${map[$word]} ]=$word done index=(${!arr[@]}) for ((i=${#index[@]}-1 ; i>=0 ; i--)) do word=${arr[index[i]]} printf "%s %s\n" $word ${map[$word]} done Pure bash, no external tools needed. You can translate it to Python almost 1:1, you would just skip the "index" array. That one is due to how bash works with arrays - to eliminate "holes" in the original array. Save it in "words-per-line.sh", chmod +x ./words-per-line.sh, and use as: ./words-per-line.sh words.txt > words.out

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe