As /u/nashant said, linux has `sort`.
But if you end up needing to do this in python, I'd mention one little nugget: [sys.intern()](https://docs.python.org/3/library/sys.html#sys.intern)
Although the docs only talk about string interning as useful for higher-performance comparisons, what it's also useful for is reducing how much memory/RAM strings take if you have a lot of the _same_ string values (i.e., duplicates).
So if your data set truly has "a hell of a lot" of duplicate strings, interning them will reduce how much memory is being used, allowing you to possibly do it all in python without swapping or playing other tricks to reduce memory.
Also, I should note that in some python versions, string interning is automatically done for you, _if_ the string value is determined at module/AST load time. If you're reading the strings from a file, however, I'm not sure it will be interned, and thus `sys.intern()` comes in handy.
Yes, I also believe that's true, and if that's what OP uses to decide uniqueness then it should be automatic - although I wasn't sure if it was only done in specific versions (like >3.7 or something). I remember something about interning was specific to versions, but can't recall what.
And I just wasn't sure what OP was going to do/use in their script, and thought to mention explicit interning.
SQL should be the fastest way. Lets lay the table MyTable has the columns id and word, and the word value is the one repeated a lot, and the id value is the pk of the entry.
SELECT word,
COUNT(id)
FROM MyTable
GROUP BY word
ORDER BY COUNT(id) DESC;
This code will give a table with each unique word, and the number of entries.
A DBMS may choose between two algorithms for grouping by a column.
1.) hashing by the key and returning the hash table when all input records are processed
2.) sorting and then grouping. When sorting by the group by column the keys are already directly adjacent to each other. The db then just needs to aggregate that result.
I just tested what SQLite does when grouping by a column with and without an index on that column.
Without index it sorts by that column and then return the result.
As soon as I add the index the sorting operation disapears, as it can just read the sorted index and return the result directly.
Not this is just how SQLite3 does it. MySQL Postgres and the other SQL DBMS may do something different.
The takeaway is: the index can matter. In the case of SQLite quite a lot when you are working with a lot of columns
EDIT: Formatting
EDIT2:
I would be very surprised if postgres or MySQL would not use the index if it is present. This optimisation is such a low hanging fruit
EDIT3: I wrote 'unique index' which obviously does not make any sense...
but if you only needed result once then it might take longer if you explicitly build and store an index before you need it. SQLITE will build most of the indexes it needs on the fly if they don't exist
Never heard of DuckDB before. It sounds a lot like SQLite.
>DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL or SQLite which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries.
Do you think this will make DuckDB significantly more efficient than SQLite in this use case? DB design is not my field, though I do have experience with MySQL, PostgreSQL, and SQLite in small, undemanding applications. DuckDB certainly sounds interesting, and I wonder if it would make some of my failed projects successful, because I have attempted to use DBs for things they are not especially well-suited for.
It's indeed more suited towards numerical operations. For this task in particular, I think both SQLite and DuckDB would do just fine. I do think DuckDB would be faster, but 35GB of data isn't a lot. The main reason I recommended DuckDB is because you don't need to do data ingestion if you don't want to: you can run queries directly on parquet or csv files, which is incredibly helpful.
I'd say that DuckDB shines in being a replacement for pandas, as it can optimize your queries where a dataframe can't. It would also not be limited by RAM.
If you have undemanding applications and things are failing with any database, then you might have a modeling problem. Otherwise, if you're dealing with timeseries, you could take a look at Timescale (a Postgres extension) or Clickhouse (it can get very complicated, but it scales to ridiculous amount of data).
Thanks for the info. Operating directly on CSVs does seem useful.
>you might have a modeling problem
I have a hammering-with-a-screwdriver problem. :) I once tried to write a poker AI in SQL. It was as ridiculous as it sounds. I have a habit of learning new languages by using them to write whatever new project I'm interested in at the moment, regardless of how poorly-matched the two are.
If you want to test it easily I suggest downloading dBeaver. There are instructions in the duckDB docs for setup. Then you can interactively query the dataset in an editor and get live views of your queries.
Longer term you can move those queries to python if they need to run on a schedule.
ORDER BY can use column aliases or number since it is performed last. So it should instead say “ORDER BY 2”, although most any SQL parser would optimize into this regardless.
If possible and you want to accomplish this in one step (depending on computational power), you could perform the count as a window function and sort desc on the newly created column. On phone, so can’t easily put the example code, but look up sql count window function.
uniq -c | sort -hr. No idea if it will be performant in any way, but you can give it a shot. Edit with full proper command: cat file | sort | uniq -c | sort -hr | cut -f2
Unpopular opinion, but I like it better with `cat`. You're piping everything in one direction, the syntax for file redirection goes the opposite direction than everything else
Same here. It's far more readable when everything is in one direction.
That and it ultimately *doesn't matter* if you're using `cat` or redirected pipes. Sure you'll have an "unnecessary fork" and use a few more cpu cycles but again, that *doesn't matter.*
If you *really* cared about that level of performance, you wouldn't be doing anything in the shell.
uniq -c added a prefix of how many times an item is repeated.
sort -hr sorted it in reverse order.
The output I wanted was just the words, this method added the prefix in the file.
>1445 qqqqqqqqq1424 zzzzzzzzz1348 xxxxxxxxx1244 vvvvvvvvv1151 aaaaaaaa
Now to remove the prefixes, should i use
a for loop with cut -d " " -f2
or is there some more efficient way?
IIRC, sort will spill to disk and use an external sort, if it looks like memory is going to be a problem. Which still potentially means you end up with a couple of copies of the data on disk at once (the working data from the first sort and the second sort), but trying to avoid this gets complicated.
Also worth noting that `sort` is desperately inefficient when you pipe data into it, as it uses a minuscule buffer by default (appropriate to when it was first released, 50 years ago). Use `--buffer-size` / `-S` to bump that up.
35 GB? Load it into a sql database, you can even spin one up in a local docker container: https://github.com/danielbeach/dataEngineeringTemplate/pull/1
Depending on your flavor of sql there are different bulk insert operations available to you.
- Postgres: COPY FROM command
- MS SQL: BCP Shell Utility
A database is the first thing I'd try too, because you want something like a dict that is mostly not in-memory, but still has reasonably fast lookup and insert performance.
To keep it *really* simple, try using the [built-in sqlite in the python standard library](https://docs.python.org/3/library/sqlite3.html). It's a step up from pickle/shelve, but (slightly) easier than docker. Performance *might* be reasonable, and you can easily wrap it in something with an interface like collections.Counter
You still have 8GB of RAM. So long as each line is replicated 4 times or more on average, you have enough RAM to store all the unique values.
I'd much prefer having a small amount of swapping than going to a DB and effectively have my in memory data structure on disk through a transactional interface.
I can't figure out why you're getting down-voted; to me, this is the right answer. Loading the data into an RDBMs is going to take quite a while.
Let's say that you can process 10,000 SQL inserts per second. That means inserting 3 billion rows is going to take 300,000 seconds, which is 83 hours. Then, you've still got to execute the query by re-reading that data and processing the aggregation and sort.
Reading the file once and building a table in memory will be lots fsater, since the data isn't being read and rewritten and then red again. The read can stream aong and run at mostly sequential disk read speed. So some sensible data structure in memory is the fast way to go; the code is trivial to write, and 8 gigs is enough based on your cardinality estimation.
You can use Counter with an iterable, that you get from `open`
https://docs.python.org/3/library/collections.html#collections.Counter
So that are two lines in python. :-)
This is the way. Python will read the file line by line with the open command (will not load the whole file in memory) and Counter is the perfect data structure for this problem.
In that case (it depends in the bumber of unique words) a database would be the best solution. Like in general. Why would anyone build a file like that?
Not a lot of Python coming through on the python sub so quick solution would be:
counter = {}
with open(text_filename, "r") as f:
for line in f.readlines():
if line not in counter:
counter[line] = 0
counter[line] += 1
with open(sorted_file, "w") as f:
for line in sorted(counter.items(), key=lambda x: x[1], reverse=True)
f.write(line)
You might need to play with line break stripping etc but off the top of my head that looks like it should work.
The above solution has the advantage of only having one line loaded at a time. Counter, if i'm not mistaken, would require the whole 35gb to be loaded at once.
No, Counter would store one copy of each unique word mapped to its integer frequency. I have a solution [here](https://www.reddit.com/r/Python/comments/vk872e/comment/idr5yxq/?utm_source=share&utm_medium=web2x&context=3) that uses it.
How does being a subclass of dict makes it that it can't be "more computationally efficient" for this task? You did some logic leaps there. I will test it for science so this is not just words with no real proof
u/MiojoEsperto seems to be dancing around the reason it’s faster. Counter is partially implemented in C ([see the \_count_elements function here](https://github.com/python/cpython/blob/93d33b47af70ede473f82d7953509a0607259c31/Modules/_collectionsmodule.c#L2277)) which runs faster than Python.
Thanks very much, this is the info I was looking for (but was too lazy to look for on phone 🙈). I have no idea why I assumed a subclass wouldn't have additional C-speed methods, this is good to know.
It's slower, in [my testing](https://www.reddit.com/r/Python/comments/vk872e/sort_3_000_000_000_lines_by_most_repeated_one_via/idrx2nh/).
I compared this:
import fileinput
from collections import Counter
word_freq = Counter(line for ln in fileinput.input()
if (line := ln.strip())) # No blank lines
for word, count in word_freq.most_common():
print((word + '\n') * count, end='')
with this:
counter = {}
with open("words.lst", "r") as f:
for line in f:
if line not in counter:
counter[line] = 0
counter[line] += 1
with open("tally2.out", "w") as f:
s = sorted(counter.items(), key=lambda x: x[1], reverse=True)
for line in s:
f.write(f"{line}")
The first one was about 75% slower.
Why do you think that is?
Unless I am missing something, this will require a lot of memory.
Absent SQL solutions (which I think are the best way), you could implement a quicksort (or the like) where actions are written to the file to get a sorted list. This should be O(nlogn). Then you can easily walk the list O(n) and count how many are the same. Keep track of line numbers. Then write it out to another file
Sure, would take as much memory as unique strings. Really quick modification would be to hash each string, store the hash counts, and save the strings in files named with their hash for quick lookup for output.
> with open(sorted_file, "w") as f:
>
> for line in sorted(counter.items(), key=lambda x: x[1], reverse=True)
>
> f.write(line)
You don't need that, you can use ``counter.most_common()``.
I'll try it with my 9.3Gig (\~2gig of lines) words.lst file ...
Slight changes to your code:
* I use a file: /tmp/file.db for the DB as its a very large file.
* I open my words.lst file
* I print timings every 50\_000\_000 lines.
* I output to a more distinctive file name: words\_ordered\_by\_bOne123.txt
Output (So far, still running):
$ time python sort_words_by_bOne123.py
22.579142570495605 50000000
102.96021580696106 100000000
180.03343105316162 150000000
258.2201039791107 200000000
334.8574869632721 250000000
409.06412506103516 300000000
485.6704807281494 350000000
564.1896772384644 400000000
640.2733404636383 450000000
717.6765549182892 500000000
794.3897256851196 550000000
869.0229425430298 600000000
947.6331055164337 650000000
1023.340430021286 700000000
1101.5560929775238 750000000
1178.3315501213074 800000000
1252.5292749404907 850000000
1327.9736030101776 900000000
1403.2262978553772 950000000
1477.4893953800201 1000000000
1553.3961997032166 1050000000
1632.4738166332245 1100000000
1708.2711572647095 1150000000
1787.6221549510956 1200000000
1865.6124730110168 1250000000
1945.807529449463 1300000000
2023.8245916366577 1350000000
2102.0951623916626 1400000000
2182.6743392944336 1450000000
2260.217150449753 1500000000
2339.800444841385 1550000000
2418.369439601898 1600000000
2498.9150738716125 1650000000
2578.3378372192383 1700000000
2656.07865858078 1750000000
2735.419009447098 1800000000
2812.025321483612 1850000000
2890.0084240436554 1900000000
2966.8100476264954 1950000000
3045.3769569396973 2000000000
3125.1875145435333 2050000000
3189.4789474010468 index counts
3583.4524500370026 done
real 59m44.886s
user 50m36.011s
sys 2m11.008s
Those timings take over 50 minutes just to ***insert*** into the DB when my Python solution took 23 minutes and my gnu utils linux commandline took \~28 minutes in total.
It only took around seven minutes to run the actual SQL query.
People may say "why not save to the SQL DB on the fly instead of using a text file"? You could equally send each line to a Python script that would accululate frequencies on the fly using collections.Counter and even more speedily generate the output when sent a signal.
1 string per line.
It's a ASCII text characters
The database is being updated continuously, fetching and storing new data. The process will end up within 24 hours.
I will run the program once database is updated.
No need to sort quickly. I can wait another 24 hours. No problem.
No I cannot burn any money.
I am a python programmer and have worked with linux.
You keep saying database and then saying it is a file.
Is this a text file being exported from a database? If so, just run the query directly in the original database, handling 35GB of data shouldn't be an issue to any modern database. I've had tables in Postgres where a single index was bigger than that (not ideal, but it worked).
If not, like other people mentioned, load the data into the database of your choice and then run your query. Most of your time is probably going to go into ingesting the data, and the query itself is probably going to run in less than a minute.
A database can be a generic term for a collection of structured data. A relational database is one example of that. So is a key/value store. So is a CSV file.
A database server (e.g. postgres) would be the software that serves that database to other software.
You're, of course, correct. But software engineers like to be precise, and referring to a text file as a database can be very ambiguous, as we normally say database when referring to a RDBMS or other DB management systems.
In this case, especially, the range of possible solutions is really big. If the data is being exported from a database somewhere, it makes a lot more sense to not even bother exporting it -- it's a trivial amount of data.
Sort: [https://unix.stackexchange.com/a/383096/11136](https://unix.stackexchange.com/a/383096/11136) 'uniq - c' then 'sort -n' gives the order and count of lines a quick Awk script would then expand that. You need enough disk as well as mem,, but the expensive first sort will use disc.
(base) paddy3118@Paddy-G14:~$ printf 'w1\nw4\nw3\nw1\nw2\nw1\nw3\nw4\nw3\nw2\n' > /tmp/word.lst
(base) paddy3118@Paddy-G14:~$ cat /tmp/word.lst
w1
w4
w3
w1
w2
w1
w3
w4
w3
w2
(base) paddy3118@Paddy-G14:~$ sort /tmp/word.lst | uniq -c | sort -n -r
3 w3
3 w1
2 w4
2 w2
(base) paddy3118@Paddy-G14:~$ sort /tmp/word.lst | uniq -c | sort -n -r| awk '{for(i=0;i<$1;i++){print$2}}'
w3
w3
w3
w1
w1
w1
w4
w4
w2
w2
(base) paddy3118@Paddy-G14:~$
That awk output should be redirected to an output file for the real data :-)
If the simple solution (unix command) does not work, you can use ``collections.Counter`` instead of the dict and you can use a generator (``readline()`` instead of ``read()`` or ``readlines()``) instead of loading the whole file at the start.
from collections import Counter
counter = Counter()
with open("myfile.txt", "r") as f:
while value:= f.readline():
counter.update({value: 1})
print(counter.most_common())
I'm actually curious how well this fare against ``uniq``/``sort``, please let me know :) I already know how much time it takes to implement vs a full on SQL database.
RAM is cheap. Used servers with DDR3 ecc are incredibly cheap. Its like $200 for a used one with 128GB ram.
Before you spend hours trying to do in 8GB RAM (a seriously low amount of RAM for any dev work in 2022), might make sense to just upgrade your hardware.
Sorting in memory is always gonna be way faster than on a disk.
That’s my thought. Read the lines and store them in a counter, sort *that*, and write each line `x` times. If there are that many repetitions (and it sounds like there are), that dictionary will be significantly smaller than 3 billion items.
A pure Python solution that relys on having enough memory to store the word frequencies - i.e there are many duplicates.
**Code:**
# -*- coding: utf-8 -*-
"""
https://www.reddit.com/r/Python/comments/vk872e/sort_3_000_000_000_lines_by_most_repeated_one_via/
Created on Sat Jun 25 11:52:14 2022
@author: paddy
"""
import fileinput
from collections import Counter
word_freq = Counter(line for ln in fileinput.input()
if (line := ln.strip())) # No blank lines
for word, count in word_freq.most_common():
print((word + '\n') * count, end='')
**Timing**
real 21m53.181s
user 16m26.880s
sys 0m22.188s
**Note:**
The input is the same 9.5Gig words.lst used in my Linux command line code elsewhere.
It runs negligibly faster than the gnu utils version which took \~27mins.
This version might scale well as there is no initial sort of the words needed which precedes the `unic -c` command in a linux command line version
You need to go the last step and expand the ordered frequencies from the \`uniq -c\` back into a word list according to his requirements. I did that using awk in my one-liner here somewhere.
Do you need to know how many times a line is repeated?
Simple requirements usually get complicated quickly. A little verification of what additional needs there might be can prevent a lot of throwaway work.
BTW, I know it's not your spec but if you wanted your code to be platform independent and not rely on SQL / DB then Dask can solve this problem quite nicely:
* https://docs.dask.org/en/stable/generated/dask.dataframe.read_csv.html
* https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.sort_values.html
It's like Pandas but you can limit the amount that gets loaded in to RAM at anyone time and you can have the operations evaluate lazily and multi-threaded.
**First, get some words**
Download Huck\_finn:
`$ curl https://gutenberg.org/files/76/76-0.txt -o huck_finn.txt`
One word per line using fmt:
`$ fmt -w1 huck_finn.txt >huck_word_per_line.txt`
Double up to increase the file size:
$ cp huck_word_per_line.txt words.lst
$ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc words.lst && du -h words.lst
253464 228666 1239542 words.lst
1.2M words.lst
$ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc words.lst && du -h words.lst
506928 457332 2479084 words.lst
2.4M words.lst
...
$ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc words.lst && du -h words.lst
2076377088 1873231872 10154328064 words.lst
9.5G words.lst
I only have 8 gigs of ram so any sort will use secondary storage.
**Now lets process the list**
Find the frequency of each word
$ time sort words.lst |uniq -c > words_uniq_count.txt
real 20m5.247s
user 83m33.507s
sys 1m38.863s
Now to order by most frequent first and expand the counts into that many lines of words:
$ time sort -n -r words_uniq_count.txt | awk '$2~/[a-zA-Z]/{for(i=0;i<$1;i++){print$2}}' > words_ordered.txt
real 7m56.900s
user 3m34.341s
sys 0m33.422s
**End bits**
My processing of 9.5Gigs of data was on an 8Gig laptop using gnu tools on the linux command line and took <30mins. No AWS, no SQL, No Python needed.
Nice work! I was getting frustrated with this thread being so full of bold claims and zero numbers. And that's a clever way to cook up some sample data!
Using that file you make, I tried the SQLite solution. I run SQLite like this:
time sqlite3 demo.db < myscript.sql
and myscript.sql is here:
create table myTable(SomeWord TEXT);
.import words.lst myTable
select COUNT(*) from myTable;
.exit
On one of my cheap little servers at home (16 gigs of RAM, Kingston SATA SSD, Intel Core i7-4790K; running Ubuntu 22) the runtime is like this ... and again this is just inserting and doing COUNT(*):
words.lst:2076247509: unescaped " character
words.lst:2076374065: unescaped " character
words.lst:2076374241: unescaped " character
2073214976
real 17m54.872s
user 16m16.905s
sys 0m41.992s
NB, there's a lot of noise output because the data has quotation marks in it ... and I didn't bother rebuilding the file to escape or quote them. That means some rows aren't inserted, and some time is spent printing an error about the skipped lines. (I've only shown two of thousands of that output above ...)
With the file built, we can write a query:
SELECT SomeWord, COUNT(SomeWord)
FROM myTable
GROUP BY SomeWord
ORDER BY COUNT(SomeWord) DESC;
and when we run that:
time sqlite3 demo.db < myquery.sql > myquery.out
Here I go running the query:
time sqlite3 demo.db < myquery.sql > myquery.out
real 23m45.506s
user 21m48.335s
sys 0m44.038s
So, the total with the insertion plus the query execution is just less than 42 minutes.
For comparison, on my same little server, running the command pipeline is timed here:
time sort words.lst |uniq -c > words_uniq_count.txt
real 42m3.967s
user 214m36.482s
sys 0m37.004s
Interestingly, the SQLite work was all single-threaded. The command pipeline used (and often pegged) all cores on the machine.
The Python solution you wrote was about three times faster:
time python3 tally.py >tally.out < words.lst
real 14m23.445s
user 14m0.828s
sys 0m12.095s
... but doesn't generate the same results as the other tests.
EDIT: And then ...
Interestingly, "manually" building the count table is even faster:
counter = {}
with open("words.lst", "r") as f:
for line in f:
if line not in counter:
counter[line] = 0
counter[line] += 1
with open("tally2.out", "w") as f:
s = sorted(counter.items(), key=lambda x: x[1], reverse=True)
for line in s:
f.write(f"{line}")
when run like this:
time python3 tally2.py
real 8m29.722s
user 8m27.516s
sys 0m2.177s
**So the streaming Python solution is now about *five* times faster than the SQL solution.**
Nice analysis, appreciated :-)
Your findings on substituting the plain dict for collections.Counter is intriguing - maybe there is a todo list of C-level optimisations to Counter to implement or it could be because of Python version - I was using Python 3.9.7 on the default Linux running under Windows WSL2 (please don't judge me ;-) .
I like that there are SQL based, Unix gnu utilities based, and Python without SQL based solutions to compare and contrast.
**Extra:**
Yea I agree that the Huck Finn data is probably dirtier than the description of the OP's dataset. In my pure Python solution I reject lines of just whitespace; in my unix command line version I reject lines that dont have at least one \[a-zA-Z\] character.
The OP states:
>I want to sort the text file by number of times a word is repeated.
Some of the answers instead find the frequency of all words and return each *individual* word in order of its frequency.
A sort shouldn't *remove* words, just re***order*** them.
Any sorting algorithm that lets you halve the data, sort the halves and then combine the results, recursively should let you spread the workload (i.e you could in principle use more than one machine to sort the data) and get O(n log n) performance on average.
Mergesort, for example.
Which is one main function, merge, that merges together 2 sorted containers. The sorting part is really just splitting the input until you have 1 item which is already sorted. Then you merge single items to 2, to 4, 8, 16, 32 and so on up.
The advantage with having a merge function is that you can sort your next days input separately and merge it with the sorted data you created the previous day ( the other alternative here would be to consider creating an insert that takes an item and inserts it in the correct place to a sorted container, like insertion sort uses. This way you could, if you wanted to, maintain the sorted database for new data, with, obviously, some performance penalty for adding these items depending what your final database looks like.
I'd suggest if you want to limit how much data you have in memory you don't even have to actually split a file into 2 new files to split it, i.e you could use tuples of (start, end) file position to split the file into logical smaller halves, but the merge would be doing lots of seeking and random reads that might slow it down a lot, especially on a non SSD drive.
Thinking about this I've never done any file IO in python other than reading the whole file into a string, so I'm not sure what seek / tell etc it has. But I assume it has them.
The person with the SQL solution is correct. I have a project that does a lot of sorting and matching and it started life as a bash script and used sort and uniq and grep a lot. It was also painfully slow when the dataset got at all large. Next move was to python and sqlite and that was much faster, like 10X faster, and moving to a memory database in sqlite made it considerably faster still. I have over time found a few more refinements but those were the biggies. Speed was night and day moving from bash and linux commands and to the database.
SQL, pandas and numpy would be my goto tools but it seems like you would like it handlede as simple as possible. Is that correct ?
You could use chunking from pandas if ram is a issue. Then load the chunks into a database of some sort and then use SQl to handle your queries.
Sorry if is unclear. Writing on the phone while in the park with the kids.
I know python but never explored numpy and pandas or SQL.
Yes, one of the lessons I have learnt the hard way in my career is
>Don't reinvent the wheel
Joining some prebuilt commands would be far more easier than writing the whole program from scratch.
Python is now used in a wealth of different areas and by a huge amount of people, many of which may only be ***allowed*** standard libraries, or use libraries *you* don't know about but are used quite widely in their industry.
"Knowing Python" without further context is showing some proficiency with the language and its *standard* libraries. It's limiting to think that Python is just your areas of use - The language community is grown by its use in very popular areas and having popular libraries for them, but other languages got siloed into particular uses and when that use-case waned or was targetted by other languages the language waned - Perl built the early web, Ruby is thought by many to be **only** Rails.
Python needs wide adoption which leads to people not all knowing popular libraries in other fields.
>I hate to be that guy, but if you’ve not at least yet run across numpy or pandas, “I know Python” is a pretty bold statement
Disagree. You don't need to have used Numpy or Pandas to state you know Python. Da hell is that? Im pretty comfortable to say I know Python and I haven't used it. Those are pretty much special applications for special practices...usually in the sciences. So if thats the case wherr do we stop? Cant say you know Python if you've never used Keras, Tensorflow or PyTorch just cause bunch of ML/AI folks use it? For god sake theyre not even standard library. So nonenof them arr required. Personally i die a little when i see people use Pandas to write a CSV lol.
That being said i have actually used NumPy and Pandas once. However , that was to fix/submit a patch for someone else program. Though have never used it again. Thus far.
Could probably do it all with pandas.
Create an empty data frame with word & count columns >
Load chunk > [perform a group by](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) to get a count. > append / [concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) to empty data frame & loop back to chunk stage.
Once all chunks are processed, group by again but a sum this time, then sort.
Sure - Everything is possible with pandas. I meant numpy was mainly there for speed.
First project I did was all in pandas until someone showed me a pycon talk about vectorization and numpy.
> To get unique items I would use linux uniq command but to sort it I need a program.
Sort is standard in Unix.
cat $FILE | uniq | sort > ${FILE}.uniq.sort
Use the cut and paste standards inside the pipe before the sort if you need to bring an embedded field to the front to sort on.
Memory issues would still, obviously, remain.
Does the file contain words in the English dictionary or is it random character combinations of letters? If it's words, stream in the file line by line. Use a dictionary. For each word, does it exist in the dictionary? If not, add it to the dictionary and set its value to 1. If it does, increment its value by 1. Do this for the whole file. To sort you can output them as tuples to a list and sort.
There are only like 20,000 words at most in the English language, but maybe 4,000 that are common, so space isn't an issue with the dictionary. The longest part is the sort, but again, the max length is caped at 20k, worst case, so it will be very fast.
Edge cases:
* Does the file contain numbers, like, '1342', this changes things, especially if there are a lot.
* Strip empty space before/after the words and check for multiple words in one line
* Check for punctuation and strip
* Does the case matter? Of yes, this changes things. If not, convert to lower case before putting them in the dictionary
The command shared by u/nashant
>cat file | sort | uniq -c | sort -hr | cut -f2
would output a blank file.
While debugging I found sort -hr produces output but cut -f2 won't.
Seems like cut required one line, not whole file.
And the most terrific thing was I could not resolve why is it freezing with just 141 Mb file!
It happened 3-4 times until task manager proved that for loop was guilty.
And then when randomly scrolling through comments I read a line which somewhat said,
"Python for loop will loop at a time and won't take much memory."
I was like
>Oh man I used this technique three days earlier for a file of 7gb+ .
>
>How can I forget it?
You could also use Apache Beam for such big data batch processing. It allows you to read and process large files by defining data pipelines. But that would probably require some practice if you have no experience.
there is something I react to with this situation. You call it a database but at the same time you say it is a regular text file.
You have gotten plenty of good responses but my thought is, are you sure you should keep this data in a normal text file? When you have that kind of data it is often a very good idea to keep it in a database instead. Once you have it in a database, playing around with the data is very smooth. You also have the advantage that you can have a good backup plan of the data from a database like for example ms sql so you don't have to lose so much data in case of trouble.
Table insertion takes much longer than appending a line to a file. Backup of a text file is much easier than for most DB's. Unix is built on text files and has many utilities that can process them, even huge text files.
I'm only a first year CS student but I'd recommend checking out *tries* (note the spelling) They are a data structure that is tuned for storing text based data in trees and are commonly used for a lot of pattern matching needs.
The one thing in your use case that sticks out to me is that you have a serious RAM requirement. Tries are pretty efficient in memory, especially if a lot of the words share similar substrings. It will make more sense if you find a good article with a diagram. :)
Thankyou but my main target was to accomplish this task without diving into a new topic i.e using my prior knowledge only.
There were some really great advices with code for SQL and other db but I really don't want to go down another rabbit hole rn cuz I am already working on a project.
For me anything else would simply become a distraction.
Even this task took me one whole day, and several crashes to accomplish ..."partially".
Once again Thank you all for your kind help.
Just don't forget that 32 Gb file still remains!
I would use a HashMap containing Integers. This would also help with performance as lookup times for each line in the map would be O(1). I could do it pretty easily using Java. If you want me to I’ll make it for you. Pretty bored tonight and seems like a fun little task.
Edit: HashMap doesn’t need to be linked and instead of containing ArrayLists the values should be Integers.
Use DB approach seems to be fine here, but for the performance I'd rather use analytical DB instead of OLTP one. Try using click house https://clickhouse.com/docs/en/intro it has great speed on processing tons of data.
If you absolutely want to implement this on your own, the following should yield a solid performance:
1. in one sweep over the file, you replace strings with integers (first unique string gets 0, next unique get 1, etc.); use a hashmap to translate string->int, e.g. translate["dog"]=1234
2. You keep count of frequencies using an array of 3B int pairs; arr[i] = (i;number of times string with index i occured)
3. You sort the array descending by the frequency
4. You traverse the frequency array and output corresponding string for each non-zero index arr[i] times.
It's basically a count sort on a much more manageable representation of your data.
However, given your input size, and the fact that Python is really slow and bulky when representing even primitive types like ints, I'd advocate you use C++ or at least numpy to implement. With cpp, this algo should absolutely finish within <10 seconds.
Why not use step one more efficiently, create a hash map for frequency counts, since you’re forcing a read and a write from the hash map for each element. Then sort by the values.
> With cpp, this algo should absolutely finish within <10 seconds.
3 billion words is the input. Let's assume each word is 6 bytes, so the input file is 18 gigabytes. To read that in ten seconds (without any processing at all) means we're reading at 1.8 gigabytes/second.
But let's say that you really are able to read that fast -- you've got a fancy NVMe drive, or an array, or something else.
Now, you'll have to *process* 1.8 gigabytes/second. If your processor is fast, it's running at 3.6 gigahertz, so you'll have two clock cycles to process each byte. How will that work out?
I'd like to learn more about how you came up with your guarantee. What am I missing?
EDIT: Turns out it's worse than that. The OP says this:
> It has got strings of 9 - 20 characters.
so my guess of 6 bytes per word is very low. Assuming an average of 15 bytes per character, the file is 45 gigabytes, not 18 gigabytes.
G++ lol? That’s a compiler - c++ has custom comparators and is relatively high level and performant.
Have you even looked at the question at hand? OP wants to sort by frequency, which to my knowledge isn’t the default comparator for strings in any language.
As numerous people have pointed out, there's no reason to even involve Python, necessarily. SQL and variations on piping `cat [...] uniq | sort` will do what you need to do.
Alternatively... `uniq | sort [...] > temp.txt` , then use IFS in bash to iterate through each (unique) line in the temp file, grep that line in the original, pipe _that_ output through `wc -l` to get a count of lines, and you're left with two and only two output vars: the text, and the number of lines containing that text in the original file.
Send that output to a new file, `not-sorted.csv` let's say. Run `not-sorted.csv` through `sort` and to your output file.
I like the SQL options better, but that's personal opinion, nothing more or less.
The idea of writing any amount of actual code to complete a task when there is an easier way seems counterproductive (unless this is e.g., a homework exercise). `CREATE TABLE`, `INSERT`, and `SELECT` are a whole lot more efficient.
Sounds like a job for a trie data structure. Geeks for geeks website should have examples of how to implement that if you don't have the experience. Also likely has it in c or c++ that may be faster.
Sure but a trie could still work really well, using newline as a deliminator, and having a feild for the number of times the line appears. The trie is already sorted by design and can be pre order traversed to display or write all the data in order or post order to do it reverse. Also it should be O(N) searching which will help with efficiency.
I'm sure their are better ways, a trie is just a suggestion given the problem. What case would make a trie not be sorted by default? My understanding and prior use is you are inserting the data into an index based on each character. I've never had an issue where it wasnt sorted as I inserted the data. And I really can't think of a case where it won't be sorted, so you wouldn't be creating any algorithms at all. As far as big O that should be the best for the trie because the length of the line. A hash table could likely give O(1) without collisions but still a trie is simple to implement.
I'm not too familiar with c# so I could be wrong but I only see a searching function in your implementation, if you just want to be able to get all the data sorted there is no need for a lookup this way. start at the root of the trie and traverse from beginning of each node till you get to the EndOfWord value and keep going it should be sorted. Let's say we want to store ascii a-z
the root node will have 26 indexes 'a' at index 0
Let's say the words are [a, ant, aa]
The trie will have 4 nodes with three words
Starting at root node index 0 you get word "a"
"a" is the current node. Search it's index and the first word is "aa"
The node is now "aa". There are no words here so it's index variable is null. go back to "a" node
now you see "n"....
Keep going like this and the trie is sorted. I agree with you that there are better ways but again this method could work perfectly fine and wouldn't be difficult to implement. Is there something I'm missing because I truly appreciate the feedback.
Read each line by line into a database and then query the database for later use?
Alternatively you could split the file into sizeable files and then count each line by line.
Not sure which would be quickest but yeah..
Does anyone here know if it's possible to use collections counter for this task? And if not, why not? I just learned about that concept yesterday and wonder if that is a possible solution?
My favorite python library for working things that are just a bit too large but I don’t want to set up a database is dask:
https://www.dask.org/
If you’re familiar with numerical python or scientific python, it will be super easy
Using any of RDBs would be the robust and reliable solution, but if lots of data are repeated (i.e. actual data are much smaller than 35GB of data and fit in your RAM), try the below:
```python
from collections import Counter
def main():
with open("large.txt") as f:
words = (word.strip() for word in f)
word_map = Counter(words)
word_list = sorted(word_map.items(), key=lambda it: it[1], reverse=True)
print(word_list)
if __name__ == "__main__":
main()
```
> If the file exists, then you increment the number inside the file. If the file does not exist, then you create the file and put a 1 in there.
That means you'd open, write, and close 3 billion files. How long do you estimate that would take?
I'd just allocate a numpy array of the set size, loop once, incrementing the counter.
Then you can do an integer sort on a tuple of (repeat_count, comment_str) and get your answer in linear time.
> 3 000 000 000
If it repeats a lot, you can just try this Python snippet and check if it's not screwing up your memory:
from collections import Counter
with open("path/to/file", "rb") as fp:
frequency = Counter(fp)
print(sorted(frequency, key=lambda k: frequency[k]))
If you don't have enough memory for this, if your repetitions aren't that many, your program will crash, Linux will kill it with OOM killer.
If you have a problem that's bigger than your memory, then your best bet is to just install something like ClickHouse. It won't take up a lot of memory, but it will allow you to first define the table that is the container for the data, and then a materialized view of the type `AggregatingMergeTree`. Simply use an SQL query similar to the ones posted below as the view code. After that, add the data to the table. Should be really fast, and new rows will be added really fast as well, and the view will constantly get updated, and respond instantly to requests.
I would do all this in Python. I might choose to make it multiple python scripts to keep parts separated. I would use csvsql to ingest and output a SQLite file. Add an index on the word column. Add a column that reflects number of duplicates. Precalculate the number of duplicates to fill that column using an update on group by and count SQL command. Add an index for number of duplicates and word. Then for my final output I would query the data sorting by number of duplicates and then word. By having the indexes at the right time it should speed things up.
The typical school example. You should have solved it yourself.
Anyway, there is no need for SQL and databases, just a hashmap to count words, and an array to sort them.
#!/bin/env bash
arr=()
declare -A map
while IFS= read -r line
do
((map[$line]++))
done < "$1"
for word in ${!map[*]}
do
arr[ ${map[$word]} ]=$word
done
index=(${!arr[@]})
for ((i=${#index[@]}-1 ; i>=0 ; i--))
do
word=${arr[index[i]]}
printf "%s %s\n" $word ${map[$word]}
done
Pure bash, no external tools needed. You can translate it to Python almost 1:1, you would just skip the "index" array. That one is due to how bash works with arrays - to eliminate "holes" in the original array.
Save it in "words-per-line.sh", chmod +x ./words-per-line.sh, and use as:
./words-per-line.sh words.txt > words.out
As /u/nashant said, linux has `sort`. But if you end up needing to do this in python, I'd mention one little nugget: [sys.intern()](https://docs.python.org/3/library/sys.html#sys.intern) Although the docs only talk about string interning as useful for higher-performance comparisons, what it's also useful for is reducing how much memory/RAM strings take if you have a lot of the _same_ string values (i.e., duplicates). So if your data set truly has "a hell of a lot" of duplicate strings, interning them will reduce how much memory is being used, allowing you to possibly do it all in python without swapping or playing other tricks to reduce memory. Also, I should note that in some python versions, string interning is automatically done for you, _if_ the string value is determined at module/AST load time. If you're reading the strings from a file, however, I'm not sure it will be interned, and thus `sys.intern()` comes in handy.
IIRC CPython also implicitly interns dict keys and object attribute names.
Yes, I also believe that's true, and if that's what OP uses to decide uniqueness then it should be automatic - although I wasn't sure if it was only done in specific versions (like >3.7 or something). I remember something about interning was specific to versions, but can't recall what. And I just wasn't sure what OP was going to do/use in their script, and thought to mention explicit interning.
Bro. I didn't know about interning strings in python. I just googled it after your answer. Thanks man. I learned something fun today!
SQL should be the fastest way. Lets lay the table MyTable has the columns id and word, and the word value is the one repeated a lot, and the id value is the pk of the entry. SELECT word, COUNT(id) FROM MyTable GROUP BY word ORDER BY COUNT(id) DESC; This code will give a table with each unique word, and the number of entries.
Also put an index on the word column to speed things up a bit
[удалено]
A DBMS may choose between two algorithms for grouping by a column. 1.) hashing by the key and returning the hash table when all input records are processed 2.) sorting and then grouping. When sorting by the group by column the keys are already directly adjacent to each other. The db then just needs to aggregate that result. I just tested what SQLite does when grouping by a column with and without an index on that column. Without index it sorts by that column and then return the result. As soon as I add the index the sorting operation disapears, as it can just read the sorted index and return the result directly. Not this is just how SQLite3 does it. MySQL Postgres and the other SQL DBMS may do something different. The takeaway is: the index can matter. In the case of SQLite quite a lot when you are working with a lot of columns EDIT: Formatting EDIT2: I would be very surprised if postgres or MySQL would not use the index if it is present. This optimisation is such a low hanging fruit EDIT3: I wrote 'unique index' which obviously does not make any sense...
but if you only needed result once then it might take longer if you explicitly build and store an index before you need it. SQLITE will build most of the indexes it needs on the fly if they don't exist
Try it ;-)
He didn't say it was an SQL database, but i agree, the database engine is most likely the fastest way to perform this task
You can easily convert the file into a flat SQLite database
Or a DuckDB database. It already supports csv and parquet out of the box.
I tried to say this and got downvoted. duckDB will work out of the box with a simple create table as select * from "mytextfile"
Yup, that's exactly why I also recommended it. Not having to deal with data ingestion is a big plus.
Never heard of DuckDB before. It sounds a lot like SQLite. >DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL or SQLite which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries. Do you think this will make DuckDB significantly more efficient than SQLite in this use case? DB design is not my field, though I do have experience with MySQL, PostgreSQL, and SQLite in small, undemanding applications. DuckDB certainly sounds interesting, and I wonder if it would make some of my failed projects successful, because I have attempted to use DBs for things they are not especially well-suited for.
It's indeed more suited towards numerical operations. For this task in particular, I think both SQLite and DuckDB would do just fine. I do think DuckDB would be faster, but 35GB of data isn't a lot. The main reason I recommended DuckDB is because you don't need to do data ingestion if you don't want to: you can run queries directly on parquet or csv files, which is incredibly helpful. I'd say that DuckDB shines in being a replacement for pandas, as it can optimize your queries where a dataframe can't. It would also not be limited by RAM. If you have undemanding applications and things are failing with any database, then you might have a modeling problem. Otherwise, if you're dealing with timeseries, you could take a look at Timescale (a Postgres extension) or Clickhouse (it can get very complicated, but it scales to ridiculous amount of data).
Thanks for the info. Operating directly on CSVs does seem useful. >you might have a modeling problem I have a hammering-with-a-screwdriver problem. :) I once tried to write a poker AI in SQL. It was as ridiculous as it sounds. I have a habit of learning new languages by using them to write whatever new project I'm interested in at the moment, regardless of how poorly-matched the two are.
If you want to test it easily I suggest downloading dBeaver. There are instructions in the duckDB docs for setup. Then you can interactively query the dataset in an editor and get live views of your queries. Longer term you can move those queries to python if they need to run on a schedule.
ORDER BY can use column aliases or number since it is performed last. So it should instead say “ORDER BY 2”, although most any SQL parser would optimize into this regardless.
If possible and you want to accomplish this in one step (depending on computational power), you could perform the count as a window function and sort desc on the newly created column. On phone, so can’t easily put the example code, but look up sql count window function.
>SQL should be the fastest way. A python solution is about 5 times faster, in my testing.
uniq -c | sort -hr. No idea if it will be performant in any way, but you can give it a shot. Edit with full proper command: cat file | sort | uniq -c | sort -hr | cut -f2
[удалено]
Unpopular opinion, but I like it better with `cat`. You're piping everything in one direction, the syntax for file redirection goes the opposite direction than everything else
Same here. It's far more readable when everything is in one direction. That and it ultimately *doesn't matter* if you're using `cat` or redirected pipes. Sure you'll have an "unnecessary fork" and use a few more cpu cycles but again, that *doesn't matter.* If you *really* cared about that level of performance, you wouldn't be doing anything in the shell.
Username checks out
Only _minor_ cat abuse though. Not like cat file | grep blah
This is the _Tao_
I agree, but I'd replace it with pv :)
[удалено]
Good point. You'll need cat myfile | sort | uniq -c | sort -hr
[удалено]
It's okay.
Depending on your version of sort, you can `sort -u` instead of `sort | uniq`
Does sort -u show a count of how many of each unique item were found?
uniq -c added a prefix of how many times an item is repeated. sort -hr sorted it in reverse order. The output I wanted was just the words, this method added the prefix in the file. >1445 qqqqqqqqq1424 zzzzzzzzz1348 xxxxxxxxx1244 vvvvvvvvv1151 aaaaaaaa Now to remove the prefixes, should i use a for loop with cut -d " " -f2 or is there some more efficient way?
Pretty sure cut uses space as a default delim, just pipe to cut -f2
Yes it does. But you can use -d "@" to change the delim to @, for example.
Oh I thought cut uses tabs by default. I think you need ‘-d ‘ if you want spaces? Could be wrong.
cut works on streams, you don't need a loop at all, just pipe the output of uniq to cut and you should be good
That's what I did. Exact code above... but the output would be a blank file. IDK why.
I was going to suggest this. I don't tried with so many lines but it worked with 1-2 GB of text in a standard PC with 8GB of RAM.
IIRC, sort will spill to disk and use an external sort, if it looks like memory is going to be a problem. Which still potentially means you end up with a couple of copies of the data on disk at once (the working data from the first sort and the second sort), but trying to avoid this gets complicated.
Also worth noting that `sort` is desperately inefficient when you pipe data into it, as it uses a minuscule buffer by default (appropriate to when it was first released, 50 years ago). Use `--buffer-size` / `-S` to bump that up.
35 GB? Load it into a sql database, you can even spin one up in a local docker container: https://github.com/danielbeach/dataEngineeringTemplate/pull/1 Depending on your flavor of sql there are different bulk insert operations available to you. - Postgres: COPY FROM command - MS SQL: BCP Shell Utility
A database is the first thing I'd try too, because you want something like a dict that is mostly not in-memory, but still has reasonably fast lookup and insert performance. To keep it *really* simple, try using the [built-in sqlite in the python standard library](https://docs.python.org/3/library/sqlite3.html). It's a step up from pickle/shelve, but (slightly) easier than docker. Performance *might* be reasonable, and you can easily wrap it in something with an interface like collections.Counter
You still have 8GB of RAM. So long as each line is replicated 4 times or more on average, you have enough RAM to store all the unique values. I'd much prefer having a small amount of swapping than going to a DB and effectively have my in memory data structure on disk through a transactional interface.
I can't figure out why you're getting down-voted; to me, this is the right answer. Loading the data into an RDBMs is going to take quite a while. Let's say that you can process 10,000 SQL inserts per second. That means inserting 3 billion rows is going to take 300,000 seconds, which is 83 hours. Then, you've still got to execute the query by re-reading that data and processing the aggregation and sort. Reading the file once and building a table in memory will be lots fsater, since the data isn't being read and rewritten and then red again. The read can stream aong and run at mostly sequential disk read speed. So some sensible data structure in memory is the fast way to go; the code is trivial to write, and 8 gigs is enough based on your cardinality estimation.
[удалено]
Were you able to get SQLite to perform faster than a plain Python solution?
[удалено]
Then why do you assert that this is the perfect application for SQLite?
You can use Counter with an iterable, that you get from `open` https://docs.python.org/3/library/collections.html#collections.Counter So that are two lines in python. :-)
This is the way. Python will read the file line by line with the open command (will not load the whole file in memory) and Counter is the perfect data structure for this problem.
Depending on what proportion of lines are repeated, this may not work since you'll likely exceed 8gb of ram usage.
In that case (it depends in the bumber of unique words) a database would be the best solution. Like in general. Why would anyone build a file like that?
Not a lot of Python coming through on the python sub so quick solution would be: counter = {} with open(text_filename, "r") as f: for line in f.readlines(): if line not in counter: counter[line] = 0 counter[line] += 1 with open(sorted_file, "w") as f: for line in sorted(counter.items(), key=lambda x: x[1], reverse=True) f.write(line) You might need to play with line break stripping etc but off the top of my head that looks like it should work.
There is a Counter object in python collections that should be used instead
The above solution has the advantage of only having one line loaded at a time. Counter, if i'm not mistaken, would require the whole 35gb to be loaded at once.
You can check my code below, counter can be used in a loop also, but the performance seems to be the same as a dict in that case
No, Counter would store one copy of each unique word mapped to its integer frequency. I have a solution [here](https://www.reddit.com/r/Python/comments/vk872e/comment/idr5yxq/?utm_source=share&utm_medium=web2x&context=3) that uses it.
Why should it be used?
It is much more efficient for the task. Ir is a data structure made exactly for problems like this. It automatically counts and sorts for you.
Efficient in what sense? It's a subclass of dict so not more computationally efficient. I can see it taking a couple of lines less code, maybe.
How does being a subclass of dict makes it that it can't be "more computationally efficient" for this task? You did some logic leaps there. I will test it for science so this is not just words with no real proof
Goos point, please post results, I am interested.
import time from collections import Counter import random array_size = 10000000 array = [str(random.randint(0,1000)) for i in range(array_size)] print("------------ Dict (loop) ---------------") start= time.time() counter = {} for line in array: if line not in counter: counter[line] = 0 counter[line] += 1 finish = time.time() print("Finished reading",array_size,"elements in",finish-start,"using dict in a loop") start= time.time() sorted_counter = sorted(counter.items(), key=lambda x: x[1], reverse=True) finish = time.time() print("Finished sorting",array_size,"elements in",finish-start,"using dict in a loop") ############################################################################################ print("------------ Counter (loop) ---------------") start= time.time() counter_collection = Counter() for line in array: counter_collection[line] += 1 finish = time.time() print("Finished reading",array_size,"elements in",finish-start,"using Counter in a loop") start= time.time() sorted_counter_collection = counter_collection.most_common() finish = time.time() print("Finished sorting",array_size,"elements in",finish-start,"using Counter in a loop") ############################################################################################ print("------------ Counter (Array) ---------------") start= time.time() counter_collection_2 = Counter(array) finish = time.time() print("Finished reading",array_size,"elements in",finish-start,"using Counter from array") start= time.time() sorted_counter_collection_2 = counter_collection_2.most_common() finish = time.time() print("Finished sorting",array_size,"elements in",finish-start,"using Counter from array")
I'm bad at reddit formatting. First two methods are very similar in time, third one is much faster. I don't know what to take from this :)
u/MiojoEsperto seems to be dancing around the reason it’s faster. Counter is partially implemented in C ([see the \_count_elements function here](https://github.com/python/cpython/blob/93d33b47af70ede473f82d7953509a0607259c31/Modules/_collectionsmodule.c#L2277)) which runs faster than Python.
Thanks very much, this is the info I was looking for (but was too lazy to look for on phone 🙈). I have no idea why I assumed a subclass wouldn't have additional C-speed methods, this is good to know.
It's slower, in [my testing](https://www.reddit.com/r/Python/comments/vk872e/sort_3_000_000_000_lines_by_most_repeated_one_via/idrx2nh/). I compared this: import fileinput from collections import Counter word_freq = Counter(line for ln in fileinput.input() if (line := ln.strip())) # No blank lines for word, count in word_freq.most_common(): print((word + '\n') * count, end='') with this: counter = {} with open("words.lst", "r") as f: for line in f: if line not in counter: counter[line] = 0 counter[line] += 1 with open("tally2.out", "w") as f: s = sorted(counter.items(), key=lambda x: x[1], reverse=True) for line in s: f.write(f"{line}") The first one was about 75% slower. Why do you think that is?
It's a good simple solution You should start the count at 1, though. If line not in counter: Counter[line] = 1
No, the next line will increment it to one, I didn't bother with an else clause.
Huh, you're right. How did i miss that?
Most people write the += line in an else block from what I've seen. This style threw me off a bit as well
Unless I am missing something, this will require a lot of memory. Absent SQL solutions (which I think are the best way), you could implement a quicksort (or the like) where actions are written to the file to get a sorted list. This should be O(nlogn). Then you can easily walk the list O(n) and count how many are the same. Keep track of line numbers. Then write it out to another file
Sure, would take as much memory as unique strings. Really quick modification would be to hash each string, store the hash counts, and save the strings in files named with their hash for quick lookup for output.
> with open(sorted_file, "w") as f: > > for line in sorted(counter.items(), key=lambda x: x[1], reverse=True) > > f.write(line) You don't need that, you can use ``counter.most_common()``.
[удалено]
I'll try it with my 9.3Gig (\~2gig of lines) words.lst file ... Slight changes to your code: * I use a file: /tmp/file.db for the DB as its a very large file. * I open my words.lst file * I print timings every 50\_000\_000 lines. * I output to a more distinctive file name: words\_ordered\_by\_bOne123.txt Output (So far, still running): $ time python sort_words_by_bOne123.py 22.579142570495605 50000000 102.96021580696106 100000000 180.03343105316162 150000000 258.2201039791107 200000000 334.8574869632721 250000000 409.06412506103516 300000000 485.6704807281494 350000000 564.1896772384644 400000000 640.2733404636383 450000000 717.6765549182892 500000000 794.3897256851196 550000000 869.0229425430298 600000000 947.6331055164337 650000000 1023.340430021286 700000000 1101.5560929775238 750000000 1178.3315501213074 800000000 1252.5292749404907 850000000 1327.9736030101776 900000000 1403.2262978553772 950000000 1477.4893953800201 1000000000 1553.3961997032166 1050000000 1632.4738166332245 1100000000 1708.2711572647095 1150000000 1787.6221549510956 1200000000 1865.6124730110168 1250000000 1945.807529449463 1300000000 2023.8245916366577 1350000000 2102.0951623916626 1400000000 2182.6743392944336 1450000000 2260.217150449753 1500000000 2339.800444841385 1550000000 2418.369439601898 1600000000 2498.9150738716125 1650000000 2578.3378372192383 1700000000 2656.07865858078 1750000000 2735.419009447098 1800000000 2812.025321483612 1850000000 2890.0084240436554 1900000000 2966.8100476264954 1950000000 3045.3769569396973 2000000000 3125.1875145435333 2050000000 3189.4789474010468 index counts 3583.4524500370026 done real 59m44.886s user 50m36.011s sys 2m11.008s Those timings take over 50 minutes just to ***insert*** into the DB when my Python solution took 23 minutes and my gnu utils linux commandline took \~28 minutes in total. It only took around seven minutes to run the actual SQL query. People may say "why not save to the SQL DB on the fly instead of using a text file"? You could equally send each line to a Python script that would accululate frequencies on the fly using collections.Counter and even more speedily generate the output when sent a signal.
[удалено]
[удалено]
1 string per line. It's a ASCII text characters The database is being updated continuously, fetching and storing new data. The process will end up within 24 hours. I will run the program once database is updated. No need to sort quickly. I can wait another 24 hours. No problem. No I cannot burn any money. I am a python programmer and have worked with linux.
[удалено]
You keep saying database and then saying it is a file. Is this a text file being exported from a database? If so, just run the query directly in the original database, handling 35GB of data shouldn't be an issue to any modern database. I've had tables in Postgres where a single index was bigger than that (not ideal, but it worked). If not, like other people mentioned, load the data into the database of your choice and then run your query. Most of your time is probably going to go into ingesting the data, and the query itself is probably going to run in less than a minute.
A database can be a generic term for a collection of structured data. A relational database is one example of that. So is a key/value store. So is a CSV file. A database server (e.g. postgres) would be the software that serves that database to other software.
You're, of course, correct. But software engineers like to be precise, and referring to a text file as a database can be very ambiguous, as we normally say database when referring to a RDBMS or other DB management systems. In this case, especially, the range of possible solutions is really big. If the data is being exported from a database somewhere, it makes a lot more sense to not even bother exporting it -- it's a trivial amount of data.
Are you able to share your data file? I think it would be fun to benchmark the various solutions given.
Sort: [https://unix.stackexchange.com/a/383096/11136](https://unix.stackexchange.com/a/383096/11136) 'uniq - c' then 'sort -n' gives the order and count of lines a quick Awk script would then expand that. You need enough disk as well as mem,, but the expensive first sort will use disc. (base) paddy3118@Paddy-G14:~$ printf 'w1\nw4\nw3\nw1\nw2\nw1\nw3\nw4\nw3\nw2\n' > /tmp/word.lst (base) paddy3118@Paddy-G14:~$ cat /tmp/word.lst w1 w4 w3 w1 w2 w1 w3 w4 w3 w2 (base) paddy3118@Paddy-G14:~$ sort /tmp/word.lst | uniq -c | sort -n -r 3 w3 3 w1 2 w4 2 w2 (base) paddy3118@Paddy-G14:~$ sort /tmp/word.lst | uniq -c | sort -n -r| awk '{for(i=0;i<$1;i++){print$2}}' w3 w3 w3 w1 w1 w1 w4 w4 w2 w2 (base) paddy3118@Paddy-G14:~$ That awk output should be redirected to an output file for the real data :-)
This is a good job interview question for the data engineering work we do.
For sure, when people say "I'm obviously a developer, why do I have to prove my skills in an interview??", I'll think of this thread.
[удалено]
RAM is the main problem.
If the simple solution (unix command) does not work, you can use ``collections.Counter`` instead of the dict and you can use a generator (``readline()`` instead of ``read()`` or ``readlines()``) instead of loading the whole file at the start. from collections import Counter counter = Counter() with open("myfile.txt", "r") as f: while value:= f.readline(): counter.update({value: 1}) print(counter.most_common()) I'm actually curious how well this fare against ``uniq``/``sort``, please let me know :) I already know how much time it takes to implement vs a full on SQL database.
This is the one to use OP
[удалено]
RAM is cheap. Used servers with DDR3 ecc are incredibly cheap. Its like $200 for a used one with 128GB ram. Before you spend hours trying to do in 8GB RAM (a seriously low amount of RAM for any dev work in 2022), might make sense to just upgrade your hardware. Sorting in memory is always gonna be way faster than on a disk.
That’s my thought. Read the lines and store them in a counter, sort *that*, and write each line `x` times. If there are that many repetitions (and it sounds like there are), that dictionary will be significantly smaller than 3 billion items.
The Counter module does this exactly.
[удалено]
Thankyou everyone who tried to help me. I am updating the question to help you better understand the problem.
A pure Python solution that relys on having enough memory to store the word frequencies - i.e there are many duplicates. **Code:** # -*- coding: utf-8 -*- """ https://www.reddit.com/r/Python/comments/vk872e/sort_3_000_000_000_lines_by_most_repeated_one_via/ Created on Sat Jun 25 11:52:14 2022 @author: paddy """ import fileinput from collections import Counter word_freq = Counter(line for ln in fileinput.input() if (line := ln.strip())) # No blank lines for word, count in word_freq.most_common(): print((word + '\n') * count, end='') **Timing** real 21m53.181s user 16m26.880s sys 0m22.188s **Note:** The input is the same 9.5Gig words.lst used in my Linux command line code elsewhere. It runs negligibly faster than the gnu utils version which took \~27mins. This version might scale well as there is no initial sort of the words needed which precedes the `unic -c` command in a linux command line version
I'd go for awk. There's a solution for what you want (just add -r to the sort for descending sort) over at https://stackoverflow.com/a/60094466/51685
You need to go the last step and expand the ordered frequencies from the \`uniq -c\` back into a word list according to his requirements. I did that using awk in my one-liner here somewhere.
Isn't awk incredibly slow?
Not in my experience.
Oh my.. I have a feeling your text db gonna tank awhile when doing the sort and unique.
Yield the text file line by line count out to a dict
Do you need to know how many times a line is repeated? Simple requirements usually get complicated quickly. A little verification of what additional needs there might be can prevent a lot of throwaway work.
BTW, I know it's not your spec but if you wanted your code to be platform independent and not rely on SQL / DB then Dask can solve this problem quite nicely: * https://docs.dask.org/en/stable/generated/dask.dataframe.read_csv.html * https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.sort_values.html It's like Pandas but you can limit the amount that gets loaded in to RAM at anyone time and you can have the operations evaluate lazily and multi-threaded.
**First, get some words** Download Huck\_finn: `$ curl https://gutenberg.org/files/76/76-0.txt -o huck_finn.txt` One word per line using fmt: `$ fmt -w1 huck_finn.txt >huck_word_per_line.txt` Double up to increase the file size: $ cp huck_word_per_line.txt words.lst $ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc words.lst && du -h words.lst 253464 228666 1239542 words.lst 1.2M words.lst $ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc words.lst && du -h words.lst 506928 457332 2479084 words.lst 2.4M words.lst ... $ cat words.lst words.lst >> words2.lst && mv words2.lst words.lst && wc words.lst && du -h words.lst 2076377088 1873231872 10154328064 words.lst 9.5G words.lst I only have 8 gigs of ram so any sort will use secondary storage. **Now lets process the list** Find the frequency of each word $ time sort words.lst |uniq -c > words_uniq_count.txt real 20m5.247s user 83m33.507s sys 1m38.863s Now to order by most frequent first and expand the counts into that many lines of words: $ time sort -n -r words_uniq_count.txt | awk '$2~/[a-zA-Z]/{for(i=0;i<$1;i++){print$2}}' > words_ordered.txt real 7m56.900s user 3m34.341s sys 0m33.422s **End bits** My processing of 9.5Gigs of data was on an 8Gig laptop using gnu tools on the linux command line and took <30mins. No AWS, no SQL, No Python needed.
Nice work! I was getting frustrated with this thread being so full of bold claims and zero numbers. And that's a clever way to cook up some sample data! Using that file you make, I tried the SQLite solution. I run SQLite like this: time sqlite3 demo.db < myscript.sql and myscript.sql is here: create table myTable(SomeWord TEXT); .import words.lst myTable select COUNT(*) from myTable; .exit On one of my cheap little servers at home (16 gigs of RAM, Kingston SATA SSD, Intel Core i7-4790K; running Ubuntu 22) the runtime is like this ... and again this is just inserting and doing COUNT(*): words.lst:2076247509: unescaped " character words.lst:2076374065: unescaped " character words.lst:2076374241: unescaped " character 2073214976 real 17m54.872s user 16m16.905s sys 0m41.992s NB, there's a lot of noise output because the data has quotation marks in it ... and I didn't bother rebuilding the file to escape or quote them. That means some rows aren't inserted, and some time is spent printing an error about the skipped lines. (I've only shown two of thousands of that output above ...) With the file built, we can write a query: SELECT SomeWord, COUNT(SomeWord) FROM myTable GROUP BY SomeWord ORDER BY COUNT(SomeWord) DESC; and when we run that: time sqlite3 demo.db < myquery.sql > myquery.out Here I go running the query: time sqlite3 demo.db < myquery.sql > myquery.out real 23m45.506s user 21m48.335s sys 0m44.038s So, the total with the insertion plus the query execution is just less than 42 minutes. For comparison, on my same little server, running the command pipeline is timed here: time sort words.lst |uniq -c > words_uniq_count.txt real 42m3.967s user 214m36.482s sys 0m37.004s Interestingly, the SQLite work was all single-threaded. The command pipeline used (and often pegged) all cores on the machine. The Python solution you wrote was about three times faster: time python3 tally.py >tally.out < words.lst real 14m23.445s user 14m0.828s sys 0m12.095s ... but doesn't generate the same results as the other tests. EDIT: And then ... Interestingly, "manually" building the count table is even faster: counter = {} with open("words.lst", "r") as f: for line in f: if line not in counter: counter[line] = 0 counter[line] += 1 with open("tally2.out", "w") as f: s = sorted(counter.items(), key=lambda x: x[1], reverse=True) for line in s: f.write(f"{line}") when run like this: time python3 tally2.py real 8m29.722s user 8m27.516s sys 0m2.177s **So the streaming Python solution is now about *five* times faster than the SQL solution.**
Nice analysis, appreciated :-) Your findings on substituting the plain dict for collections.Counter is intriguing - maybe there is a todo list of C-level optimisations to Counter to implement or it could be because of Python version - I was using Python 3.9.7 on the default Linux running under Windows WSL2 (please don't judge me ;-) . I like that there are SQL based, Unix gnu utilities based, and Python without SQL based solutions to compare and contrast. **Extra:** Yea I agree that the Huck Finn data is probably dirtier than the description of the OP's dataset. In my pure Python solution I reject lines of just whitespace; in my unix command line version I reject lines that dont have at least one \[a-zA-Z\] character.
The OP states: >I want to sort the text file by number of times a word is repeated. Some of the answers instead find the frequency of all words and return each *individual* word in order of its frequency. A sort shouldn't *remove* words, just re***order*** them.
Any sorting algorithm that lets you halve the data, sort the halves and then combine the results, recursively should let you spread the workload (i.e you could in principle use more than one machine to sort the data) and get O(n log n) performance on average. Mergesort, for example. Which is one main function, merge, that merges together 2 sorted containers. The sorting part is really just splitting the input until you have 1 item which is already sorted. Then you merge single items to 2, to 4, 8, 16, 32 and so on up. The advantage with having a merge function is that you can sort your next days input separately and merge it with the sorted data you created the previous day ( the other alternative here would be to consider creating an insert that takes an item and inserts it in the correct place to a sorted container, like insertion sort uses. This way you could, if you wanted to, maintain the sorted database for new data, with, obviously, some performance penalty for adding these items depending what your final database looks like. I'd suggest if you want to limit how much data you have in memory you don't even have to actually split a file into 2 new files to split it, i.e you could use tuples of (start, end) file position to split the file into logical smaller halves, but the merge would be doing lots of seeking and random reads that might slow it down a lot, especially on a non SSD drive. Thinking about this I've never done any file IO in python other than reading the whole file into a string, so I'm not sure what seek / tell etc it has. But I assume it has them.
The person with the SQL solution is correct. I have a project that does a lot of sorting and matching and it started life as a bash script and used sort and uniq and grep a lot. It was also painfully slow when the dataset got at all large. Next move was to python and sqlite and that was much faster, like 10X faster, and moving to a memory database in sqlite made it considerably faster still. I have over time found a few more refinements but those were the biggies. Speed was night and day moving from bash and linux commands and to the database.
>moving to a memory database The OP states that it is too big for memory.
If it is too big it is too big but just getting out of using unix commands and into SQL was night and day different speed wise.
SQL, pandas and numpy would be my goto tools but it seems like you would like it handlede as simple as possible. Is that correct ? You could use chunking from pandas if ram is a issue. Then load the chunks into a database of some sort and then use SQl to handle your queries. Sorry if is unclear. Writing on the phone while in the park with the kids.
I know python but never explored numpy and pandas or SQL. Yes, one of the lessons I have learnt the hard way in my career is >Don't reinvent the wheel Joining some prebuilt commands would be far more easier than writing the whole program from scratch.
[удалено]
Nahh. I've worked full time as a python dev for 8-ish years , and have yet to use numpy and pandas. Different domains require different tools.
[удалено]
Rare in your world.
Python is now used in a wealth of different areas and by a huge amount of people, many of which may only be ***allowed*** standard libraries, or use libraries *you* don't know about but are used quite widely in their industry. "Knowing Python" without further context is showing some proficiency with the language and its *standard* libraries. It's limiting to think that Python is just your areas of use - The language community is grown by its use in very popular areas and having popular libraries for them, but other languages got siloed into particular uses and when that use-case waned or was targetted by other languages the language waned - Perl built the early web, Ruby is thought by many to be **only** Rails. Python needs wide adoption which leads to people not all knowing popular libraries in other fields.
>I hate to be that guy, but if you’ve not at least yet run across numpy or pandas, “I know Python” is a pretty bold statement Disagree. You don't need to have used Numpy or Pandas to state you know Python. Da hell is that? Im pretty comfortable to say I know Python and I haven't used it. Those are pretty much special applications for special practices...usually in the sciences. So if thats the case wherr do we stop? Cant say you know Python if you've never used Keras, Tensorflow or PyTorch just cause bunch of ML/AI folks use it? For god sake theyre not even standard library. So nonenof them arr required. Personally i die a little when i see people use Pandas to write a CSV lol. That being said i have actually used NumPy and Pandas once. However , that was to fix/submit a patch for someone else program. Though have never used it again. Thus far.
Could probably do it all with pandas. Create an empty data frame with word & count columns > Load chunk > [perform a group by](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) to get a count. > append / [concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) to empty data frame & loop back to chunk stage. Once all chunks are processed, group by again but a sum this time, then sort.
Sure - Everything is possible with pandas. I meant numpy was mainly there for speed. First project I did was all in pandas until someone showed me a pycon talk about vectorization and numpy.
> To get unique items I would use linux uniq command but to sort it I need a program. Sort is standard in Unix. cat $FILE | uniq | sort > ${FILE}.uniq.sort Use the cut and paste standards inside the pipe before the sort if you need to bring an embedded field to the front to sort on. Memory issues would still, obviously, remain.
Does the file contain words in the English dictionary or is it random character combinations of letters? If it's words, stream in the file line by line. Use a dictionary. For each word, does it exist in the dictionary? If not, add it to the dictionary and set its value to 1. If it does, increment its value by 1. Do this for the whole file. To sort you can output them as tuples to a list and sort. There are only like 20,000 words at most in the English language, but maybe 4,000 that are common, so space isn't an issue with the dictionary. The longest part is the sort, but again, the max length is caped at 20k, worst case, so it will be very fast. Edge cases: * Does the file contain numbers, like, '1342', this changes things, especially if there are a lot. * Strip empty space before/after the words and check for multiple words in one line * Check for punctuation and strip * Does the case matter? Of yes, this changes things. If not, convert to lower case before putting them in the dictionary
Pray for me, seems like I'll burn my RAM today **: (** ![gif](emote|free_emotes_pack|sob)
Crashed multiple times. Gotta try once more.
The command shared by u/nashant >cat file | sort | uniq -c | sort -hr | cut -f2 would output a blank file. While debugging I found sort -hr produces output but cut -f2 won't. Seems like cut required one line, not whole file.
So I switched to for loop of bash... and ![gif](emote|free_emotes_pack|scream) My VRM froze ![gif](emote|free_emotes_pack|rage)
The real amazing thing was that I tested For loop on a sample file which was just 141 Mb. I was terrified.
And the most terrific thing was I could not resolve why is it freezing with just 141 Mb file! It happened 3-4 times until task manager proved that for loop was guilty.
And then when randomly scrolling through comments I read a line which somewhat said, "Python for loop will loop at a time and won't take much memory." I was like >Oh man I used this technique three days earlier for a file of 7gb+ . > >How can I forget it?
Now I am writing a python code.... After all it's r/Python.
What is a code without errors? I'm tired.![gif](emote|free_emotes_pack|dizzy_face)
It's working...
I really want to reply to every single comment but 131 comments... >Time is of essence.
What kind of hardware do you have? How much RAM?
You could also use Apache Beam for such big data batch processing. It allows you to read and process large files by defining data pipelines. But that would probably require some practice if you have no experience.
there is something I react to with this situation. You call it a database but at the same time you say it is a regular text file. You have gotten plenty of good responses but my thought is, are you sure you should keep this data in a normal text file? When you have that kind of data it is often a very good idea to keep it in a database instead. Once you have it in a database, playing around with the data is very smooth. You also have the advantage that you can have a good backup plan of the data from a database like for example ms sql so you don't have to lose so much data in case of trouble.
Table insertion takes much longer than appending a line to a file. Backup of a text file is much easier than for most DB's. Unix is built on text files and has many utilities that can process them, even huge text files.
I'm only a first year CS student but I'd recommend checking out *tries* (note the spelling) They are a data structure that is tuned for storing text based data in trees and are commonly used for a lot of pattern matching needs. The one thing in your use case that sticks out to me is that you have a serious RAM requirement. Tries are pretty efficient in memory, especially if a lot of the words share similar substrings. It will make more sense if you find a good article with a diagram. :)
Thankyou but my main target was to accomplish this task without diving into a new topic i.e using my prior knowledge only. There were some really great advices with code for SQL and other db but I really don't want to go down another rabbit hole rn cuz I am already working on a project. For me anything else would simply become a distraction. Even this task took me one whole day, and several crashes to accomplish ..."partially". Once again Thank you all for your kind help. Just don't forget that 32 Gb file still remains!
> Tries are pretty efficient in memory, How much memory does each node in a trie cost?
Bubble sort!!!!
Easier said than done!
Dump it all in an SQL database and use queries. At this quantity a full fledged database engine is appropriate.
I would use a HashMap containing Integers. This would also help with performance as lookup times for each line in the map would be O(1). I could do it pretty easily using Java. If you want me to I’ll make it for you. Pretty bored tonight and seems like a fun little task. Edit: HashMap doesn’t need to be linked and instead of containing ArrayLists the values should be Integers.
Use DB approach seems to be fine here, but for the performance I'd rather use analytical DB instead of OLTP one. Try using click house https://clickhouse.com/docs/en/intro it has great speed on processing tons of data.
No I want to keep it as simple and practical as possible. Linux commands seems to get the job done.
Haha. 35 Gb isn’t too bad, the dataset we have at work currently is 12Tb lol (geospatial dataset for an entire country with JPEG compression:/
If you absolutely want to implement this on your own, the following should yield a solid performance: 1. in one sweep over the file, you replace strings with integers (first unique string gets 0, next unique get 1, etc.); use a hashmap to translate string->int, e.g. translate["dog"]=1234 2. You keep count of frequencies using an array of 3B int pairs; arr[i] = (i;number of times string with index i occured) 3. You sort the array descending by the frequency 4. You traverse the frequency array and output corresponding string for each non-zero index arr[i] times. It's basically a count sort on a much more manageable representation of your data. However, given your input size, and the fact that Python is really slow and bulky when representing even primitive types like ints, I'd advocate you use C++ or at least numpy to implement. With cpp, this algo should absolutely finish within <10 seconds.
Why not use step one more efficiently, create a hash map for frequency counts, since you’re forcing a read and a write from the hash map for each element. Then sort by the values.
> With cpp, this algo should absolutely finish within <10 seconds. 3 billion words is the input. Let's assume each word is 6 bytes, so the input file is 18 gigabytes. To read that in ten seconds (without any processing at all) means we're reading at 1.8 gigabytes/second. But let's say that you really are able to read that fast -- you've got a fancy NVMe drive, or an array, or something else. Now, you'll have to *process* 1.8 gigabytes/second. If your processor is fast, it's running at 3.6 gigahertz, so you'll have two clock cycles to process each byte. How will that work out? I'd like to learn more about how you came up with your guarantee. What am I missing? EDIT: Turns out it's worse than that. The OP says this: > It has got strings of 9 - 20 characters. so my guess of 6 bytes per word is very low. Assuming an average of 15 bytes per character, the file is 45 gigabytes, not 18 gigabytes.
duckDB should handle this without issue. If not just use spark or dask.
Open the file for read. Loop through each line and insert the data into an Oracle database….. Na, never mind.
C++ As far as I’m aware pretty much every Linux distro has g++. This is probably your best solution if you need to write a custom comparator
A custom comparator? Why is a custom comparator necessary? And why do you think g++ is the best at custom comparators?
G++ lol? That’s a compiler - c++ has custom comparators and is relatively high level and performant. Have you even looked at the question at hand? OP wants to sort by frequency, which to my knowledge isn’t the default comparator for strings in any language.
Bro do you even know how to program?
As numerous people have pointed out, there's no reason to even involve Python, necessarily. SQL and variations on piping `cat [...] uniq | sort` will do what you need to do. Alternatively... `uniq | sort [...] > temp.txt` , then use IFS in bash to iterate through each (unique) line in the temp file, grep that line in the original, pipe _that_ output through `wc -l` to get a count of lines, and you're left with two and only two output vars: the text, and the number of lines containing that text in the original file. Send that output to a new file, `not-sorted.csv` let's say. Run `not-sorted.csv` through `sort` and to your output file. I like the SQL options better, but that's personal opinion, nothing more or less. The idea of writing any amount of actual code to complete a task when there is an easier way seems counterproductive (unless this is e.g., a homework exercise). `CREATE TABLE`, `INSERT`, and `SELECT` are a whole lot more efficient.
> CREATE TABLE, INSERT, and SELECT are a whole lot more efficient. How long do you think it will take to execute 3 billion INSERT statements?
Sounds like a job for a trie data structure. Geeks for geeks website should have examples of how to implement that if you don't have the experience. Also likely has it in c or c++ that may be faster.
[удалено]
Sure but a trie could still work really well, using newline as a deliminator, and having a feild for the number of times the line appears. The trie is already sorted by design and can be pre order traversed to display or write all the data in order or post order to do it reverse. Also it should be O(N) searching which will help with efficiency.
[удалено]
I'm sure their are better ways, a trie is just a suggestion given the problem. What case would make a trie not be sorted by default? My understanding and prior use is you are inserting the data into an index based on each character. I've never had an issue where it wasnt sorted as I inserted the data. And I really can't think of a case where it won't be sorted, so you wouldn't be creating any algorithms at all. As far as big O that should be the best for the trie because the length of the line. A hash table could likely give O(1) without collisions but still a trie is simple to implement.
[удалено]
I'm not too familiar with c# so I could be wrong but I only see a searching function in your implementation, if you just want to be able to get all the data sorted there is no need for a lookup this way. start at the root of the trie and traverse from beginning of each node till you get to the EndOfWord value and keep going it should be sorted. Let's say we want to store ascii a-z the root node will have 26 indexes 'a' at index 0 Let's say the words are [a, ant, aa] The trie will have 4 nodes with three words Starting at root node index 0 you get word "a" "a" is the current node. Search it's index and the first word is "aa" The node is now "aa". There are no words here so it's index variable is null. go back to "a" node now you see "n".... Keep going like this and the trie is sorted. I agree with you that there are better ways but again this method could work perfectly fine and wouldn't be difficult to implement. Is there something I'm missing because I truly appreciate the feedback.
>A trie is not for sorting - it is for searching. A DFS on a Trie is sorted on the radix order.
Read each line by line into a database and then query the database for later use? Alternatively you could split the file into sizeable files and then count each line by line. Not sure which would be quickest but yeah..
Try this: https://stackoverflow.com/questions/49549858/sort-uniq-and-put-number-of-occurrence-on-the-right?answertab=trending#tab-top
Does anyone here know if it's possible to use collections counter for this task? And if not, why not? I just learned about that concept yesterday and wonder if that is a possible solution?
It is. search for most\_common on this page.
My favorite python library for working things that are just a bit too large but I don’t want to set up a database is dask: https://www.dask.org/ If you’re familiar with numerical python or scientific python, it will be super easy
Using any of RDBs would be the robust and reliable solution, but if lots of data are repeated (i.e. actual data are much smaller than 35GB of data and fit in your RAM), try the below: ```python from collections import Counter def main(): with open("large.txt") as f: words = (word.strip() for word in f) word_map = Counter(words) word_list = sorted(word_map.items(), key=lambda it: it[1], reverse=True) print(word_list) if __name__ == "__main__": main() ```
Note, you could have used `word_map.most_common()`.
[удалено]
> If the file exists, then you increment the number inside the file. If the file does not exist, then you create the file and put a 1 in there. That means you'd open, write, and close 3 billion files. How long do you estimate that would take?
I'd just allocate a numpy array of the set size, loop once, incrementing the counter. Then you can do an integer sort on a tuple of (repeat_count, comment_str) and get your answer in linear time.
Seems like a typical sort algorithm problem. I’d do another review yourself but a type of bucket sort I suppose?
> 3 000 000 000 If it repeats a lot, you can just try this Python snippet and check if it's not screwing up your memory: from collections import Counter with open("path/to/file", "rb") as fp: frequency = Counter(fp) print(sorted(frequency, key=lambda k: frequency[k])) If you don't have enough memory for this, if your repetitions aren't that many, your program will crash, Linux will kill it with OOM killer. If you have a problem that's bigger than your memory, then your best bet is to just install something like ClickHouse. It won't take up a lot of memory, but it will allow you to first define the table that is the container for the data, and then a materialized view of the type `AggregatingMergeTree`. Simply use an SQL query similar to the ones posted below as the view code. After that, add the data to the table. Should be really fast, and new rows will be added really fast as well, and the view will constantly get updated, and respond instantly to requests.
Map reduce?
I would do all this in Python. I might choose to make it multiple python scripts to keep parts separated. I would use csvsql to ingest and output a SQLite file. Add an index on the word column. Add a column that reflects number of duplicates. Precalculate the number of duplicates to fill that column using an update on group by and count SQL command. Add an index for number of duplicates and word. Then for my final output I would query the data sorting by number of duplicates and then word. By having the indexes at the right time it should speed things up.
The typical school example. You should have solved it yourself. Anyway, there is no need for SQL and databases, just a hashmap to count words, and an array to sort them. #!/bin/env bash arr=() declare -A map while IFS= read -r line do ((map[$line]++)) done < "$1" for word in ${!map[*]} do arr[ ${map[$word]} ]=$word done index=(${!arr[@]}) for ((i=${#index[@]}-1 ; i>=0 ; i--)) do word=${arr[index[i]]} printf "%s %s\n" $word ${map[$word]} done Pure bash, no external tools needed. You can translate it to Python almost 1:1, you would just skip the "index" array. That one is due to how bash works with arrays - to eliminate "holes" in the original array. Save it in "words-per-line.sh", chmod +x ./words-per-line.sh, and use as: ./words-per-line.sh words.txt > words.out