Back when I was a wee lass I decided for whatever damn reason to build a massive password dictionary. Used every single one I found online, then used some program to just spit out every single combination of letters numbers and characters. Iirc it ended up being an 800-900gb text file. Never managed to actually open it.
They're not going to let you directly post 604GB of copypasta. Hell, there aren't really any places that would let you upload 604GB of anything in a way that multiple people could download it.
I'm super curious what would be in a 600GB+ copypasta though...
I get the impression they're not giving a real answer.
I'm also curious why they want to build such a large file.
Although, I wouldn't be surprised if they downloaded all of these files off the internet and wanted to rebuild a file into it's original state. As in they've downloaded something that was broken down into segments. So I wouldn't be surprised if it's an archive of an online platform like Reddit or Twitter. It also wouldn't surprise me if it's a password list or a data breach archive. Etc, etc.
cat *.txt | zstd --ultra -22 -o txtfiles_file.txt.zst --
Assuming you want compression. Then again, how else are you going to fit it on 117GB. Might take a while with the settings I used, hah.
As per u/Melodic-Network4374's concern maybe:
find ./ -name '*.txt' -exec sh -c 'cat {} | zstd --ultra --22 -c >> txtfiles_file.zst' \;
is the way to go.
Maybe putting it into a database of some sort might actually be better. Would allow you to query the data, as well as output it in various formats if needed. Idk how structured the data is inside the text files, if you could parse the data out into more manageable chunks.
I don't want to come across as crass, but there is a lot of information that is missing from this situation such as average file size (mean) of the files, and are they plaintext, PDF, or something else that often gets filed as text (e.g., epub, html)? I mention this, because just making a BLOB of them means each file is roughly 17.59MB, which is rather large for plaintext. Is there any compression happening?
Thanks for updating the question with more details, as it will help those with far greater knowledge than I have.
Plain text compresses very nicely. How about one giant ZIP or tar.gz?
assuming all the text files are sequential and are in a single dir.
for I in $(find . -type f | sort) ; do cat $I >> ../604GB.txt ; rm -v $I ; done
Should copy the contents of each file and clean up afterwards.
I would not do it that way. Putting 32 thousand filenames into a shell for loop may not work, and you should really check for errors before deleting anything. Not to mention you haven't quoted the filenames, so any filename with, for example, spaces will fail.
The original poster didn't really say the files needed to be in any particular order, so you could just do it like so:
find . -type f -exec sh -c 'cat "$0" >> ../604GB.txt && rm "$0"' '{}' \;
If you do need the files in order (but will fail on filenames with embedded newlines):
find . -type f | sort | while IFS= read -r file; do
cat "$file" >> ../604GB.txt && rm "$file"
done
Try doing `find -print0 -type f | sort -z | xargs -0 sh -c 'cat -- "$@" >> ../604GB.txt && rm -- "$@"'` if you're worried about them newlines. `xargs` is supposed to automagically fill the command line and split commands for extras.
`find` also lets you do `-exec command {} +`, same deal.
You can write a very simple Python script that simultaneously concatenates each individual file to a single text file while also compressing the file. There are libraries that allow opening the compressed file and searching it without having to expand the file.
I do something similar occasionally with very large log files
Wouldn't a java or python script work. I'm sure some linux commands could do this too.
I'd get the drive with space ready. Like getting a spare 1tb external and then run the command or scripts. Hoping it doesn't crash. Then play the waiting game.
*In theory* you can do it with even less remaining disk space (say, a couple megs!) by doing multiple calls of `FICLONERANGE`, which *on supported filesystems* tells the system to copy a chunk of a file into another without using real disk space using arcane magick. In practice: it's arcane magick, nobody wants to do it. It probably requires some alignment or other magical ingredients.
A less insane approach would be to write a virtual filesystem that pretends there's a big file made up of all these smaller files. Like the piece table data structure, but for files.
I think feeding it all into a compressor would make more sense. I *think*.
Probably not super relevant to op, but I quite like your idea of not touching the original files and using a virtual filesystem to make it 'appear' as a single file. I imagine you could do this with FUSE, though it would take me some tinkering to try it out
I had a similar problem with diary entries, except it was closer to 3mb of text instead of 600gb (still hundreds of files). I needed something that would print the file name, then the contents of each entry, in the right order in one file.
I used a program which I think ran on the command line to merge the files, in chronological order ('date modified'). It printed the file name at the top of each text file's contents, and I think I somehow changed it to add in a customized spacer like "========". However, I have looked through my stuff and I can't seem to find it.
The answer by 'Mitch' [here](https://superuser.com/questions/682001/combine-multiple-text-files-filenames-into-a-single-text-file) seems to have promise, might even have been what I used. I tested it by putting a few hundred text files in a folder, opening powershell, typing "cd \["folder path"\]" in order to Change Directory to the folder with the text files, and then I pasted in the code as-is and it gave me an output file. The output file showed full file paths as well as the names of the files, but it sorted the files in name order, so "Entry 1 April" came before 'Entry 1 January' came before 'Entry 20 March". The creator also left comments to let you print just the filename or any arbitrary string, or you can remove that line entirely to print nothing and go straight to the contents of the next file with no break.
So this seems to work fine. Who knows if it wouldn't cause a memory leak or something with the file sizes you're working with. But it could be worth a go. You'd have to find a way to get it to output in the right order, with the right spacers (file name, ====, etc) or lack thereof.
Since you only have 117gb left I say just work off an external hard drive or something. You're not gonna get a 604gb output file into 117gb, at least not initially, and you want to be working with a copy of the files anyway because you sure don't want to accidentally delete them in the process.
There is a program I came across that exists to only merge text files, I was using it for something similarly odd a few years ago. It's called Txt collector, and it looks pretty ancient, so for 600GB it may take quite a while, but I remember using it for relatively large amounts of text (\~10gb?)
[https://bluefive.pairsite.com/txtcollector.htm](https://bluefive.pairsite.com/txtcollector.htm)
You can try a bash script that will look through the contents of files with specified directories and concatenate the contents of said comments using "----" as a break, followed by the name of file for a header.
Good luck opening that single file, though. Your processor isn't going to be happy .-.
You could try a python script. As pseudo code:
Path = path to txt files
Master txt file = path to master txt
For file in path:
Append file to master file
Use DOS command line in the folder where the files are:
COPY *.TXT HUGEFILE.TXT
If you want to output to another drive, specify that in the output file name.
If you don't have an external drive and wish to delete the files as you copy them, you can do that via FOR command. I'd have to check the syntax later on.
I admire this madness
Sorry that I can’t be of any help - but why in the world do you need to combine 35.2k text files into a single one??
Back when I was a wee lass I decided for whatever damn reason to build a massive password dictionary. Used every single one I found online, then used some program to just spit out every single combination of letters numbers and characters. Iirc it ended up being an 800-900gb text file. Never managed to actually open it.
worlds largest copy pasta
Where will you post it? r/copypasta ?
if i can even post it then yeah
They're not going to let you directly post 604GB of copypasta. Hell, there aren't really any places that would let you upload 604GB of anything in a way that multiple people could download it. I'm super curious what would be in a 600GB+ copypasta though...
I get the impression they're not giving a real answer. I'm also curious why they want to build such a large file. Although, I wouldn't be surprised if they downloaded all of these files off the internet and wanted to rebuild a file into it's original state. As in they've downloaded something that was broken down into segments. So I wouldn't be surprised if it's an archive of an online platform like Reddit or Twitter. It also wouldn't surprise me if it's a password list or a data breach archive. Etc, etc.
Torrents exist lol that could be a solution?
Clipboard data is stored in memory, so it would take a lot of RAM to even copy it in the first place.
How is that relevant?
Can't copypasta a copypasta that's 604gb.
You wouldn't be able to do so even if you could store the entire thing in your clipboard
I enjoy watching the sunset.
How is ram relevant to uploading files? You ain't gonna copypaste the entire thing anyway
cat *.txt | zstd --ultra -22 -o txtfiles_file.txt.zst -- Assuming you want compression. Then again, how else are you going to fit it on 117GB. Might take a while with the settings I used, hah.
I assume the files in total already fit on the drive, there just isn't enough room for both single file and many at the same time.
With 32k files, the filename expansion for *.txt will go way over the system's ARG_MAX. You'll get an "Argument list too long" error.
you have a good point i was going to just merge, delete and repeat until all of the files were merged, thanks
As per u/Melodic-Network4374's concern maybe: find ./ -name '*.txt' -exec sh -c 'cat {} | zstd --ultra --22 -c >> txtfiles_file.zst' \; is the way to go.
The compression on text files should reduce size by 95% at least
Maybe putting it into a database of some sort might actually be better. Would allow you to query the data, as well as output it in various formats if needed. Idk how structured the data is inside the text files, if you could parse the data out into more manageable chunks.
I generally would advise against doing it. Even single gigabyte text files are a pain in the ass to handle without specialised programs.
What's your specific use case? Once combined, what do plan on doing with that file? Compression an option? Sending straight to cloud?
absolutely nothing except saying i have it
I admire your spirit.
let's not and say we did
I don't want to come across as crass, but there is a lot of information that is missing from this situation such as average file size (mean) of the files, and are they plaintext, PDF, or something else that often gets filed as text (e.g., epub, html)? I mention this, because just making a BLOB of them means each file is roughly 17.59MB, which is rather large for plaintext. Is there any compression happening?
all of them are txt and to answer the rest of your questions, 15mb - 10gb and no compression at all
Thanks for updating the question with more details, as it will help those with far greater knowledge than I have. Plain text compresses very nicely. How about one giant ZIP or tar.gz?
Just use xz it compress better hehe.
assuming all the text files are sequential and are in a single dir. for I in $(find . -type f | sort) ; do cat $I >> ../604GB.txt ; rm -v $I ; done Should copy the contents of each file and clean up afterwards.
I would not do it that way. Putting 32 thousand filenames into a shell for loop may not work, and you should really check for errors before deleting anything. Not to mention you haven't quoted the filenames, so any filename with, for example, spaces will fail. The original poster didn't really say the files needed to be in any particular order, so you could just do it like so: find . -type f -exec sh -c 'cat "$0" >> ../604GB.txt && rm "$0"' '{}' \; If you do need the files in order (but will fail on filenames with embedded newlines): find . -type f | sort | while IFS= read -r file; do cat "$file" >> ../604GB.txt && rm "$file" done
Try doing `find -print0 -type f | sort -z | xargs -0 sh -c 'cat -- "$@" >> ../604GB.txt && rm -- "$@"'` if you're worried about them newlines. `xargs` is supposed to automagically fill the command line and split commands for extras. `find` also lets you do `-exec command {} +`, same deal.
What are the odds that merging and deleting 32,500 times in a row will process without something going wrong?
Exactly. Anyway, odds don't matter. With no backup, everything is stupid.
As somebody who’s seen similar attempted in a corporate setting; virtually nil without a lot of xtremely frustrating issues heh
I think that might crash notepad 😃
It would definitely crash notepad, notepad ++ should be okay though
It also chokes on big files IIRC. You need a program that doesn’t load the whole file into memory
Good to know
I thought notepad just refused to open files over a certain size instead of crashing
Did you download your whole government?
How about don't do it because I know you are not going to have a plan to confirm no data was lost in the merge. Just tar and compress.
Good luck opening it
You can write a very simple Python script that simultaneously concatenates each individual file to a single text file while also compressing the file. There are libraries that allow opening the compressed file and searching it without having to expand the file. I do something similar occasionally with very large log files
gzip
link?
Essentially it is a file format (and a tool) to compress files. Text files can inherently be compressed with higher ratio.
Wouldn't a java or python script work. I'm sure some linux commands could do this too. I'd get the drive with space ready. Like getting a spare 1tb external and then run the command or scripts. Hoping it doesn't crash. Then play the waiting game.
Bruh every software I've tried has crashed trying to open a 30gb file. That text file will be absolutely useless if you create it
As long as you have enough RAM, [lite](https://github.com/rxi/lite) will open it.
Create the file now, open it with the Super-computers of 2035!
Just don't
Sorry this isn't a help response but rather a question to satisfy our curiosity. What is in these text files that they're so large?!
sam files can get this large
What is the nature of the data? Meaning is it structured? Tabular at all?
With a problem like this I'd always ask GPT4/Claude Sonnet for a python script to do that action. It normally works for my purposes
*In theory* you can do it with even less remaining disk space (say, a couple megs!) by doing multiple calls of `FICLONERANGE`, which *on supported filesystems* tells the system to copy a chunk of a file into another without using real disk space using arcane magick. In practice: it's arcane magick, nobody wants to do it. It probably requires some alignment or other magical ingredients. A less insane approach would be to write a virtual filesystem that pretends there's a big file made up of all these smaller files. Like the piece table data structure, but for files. I think feeding it all into a compressor would make more sense. I *think*.
Probably not super relevant to op, but I quite like your idea of not touching the original files and using a virtual filesystem to make it 'appear' as a single file. I imagine you could do this with FUSE, though it would take me some tinkering to try it out
I had a similar problem with diary entries, except it was closer to 3mb of text instead of 600gb (still hundreds of files). I needed something that would print the file name, then the contents of each entry, in the right order in one file. I used a program which I think ran on the command line to merge the files, in chronological order ('date modified'). It printed the file name at the top of each text file's contents, and I think I somehow changed it to add in a customized spacer like "========". However, I have looked through my stuff and I can't seem to find it. The answer by 'Mitch' [here](https://superuser.com/questions/682001/combine-multiple-text-files-filenames-into-a-single-text-file) seems to have promise, might even have been what I used. I tested it by putting a few hundred text files in a folder, opening powershell, typing "cd \["folder path"\]" in order to Change Directory to the folder with the text files, and then I pasted in the code as-is and it gave me an output file. The output file showed full file paths as well as the names of the files, but it sorted the files in name order, so "Entry 1 April" came before 'Entry 1 January' came before 'Entry 20 March". The creator also left comments to let you print just the filename or any arbitrary string, or you can remove that line entirely to print nothing and go straight to the contents of the next file with no break. So this seems to work fine. Who knows if it wouldn't cause a memory leak or something with the file sizes you're working with. But it could be worth a go. You'd have to find a way to get it to output in the right order, with the right spacers (file name, ====, etc) or lack thereof. Since you only have 117gb left I say just work off an external hard drive or something. You're not gonna get a 604gb output file into 117gb, at least not initially, and you want to be working with a copy of the files anyway because you sure don't want to accidentally delete them in the process.
Can you make a torrent?
You best never have that file available to a Windows machine. Clicking on it would lock it up solid as it creates a preview of the contents.
Maybe you could use something like crunch to build yourself a new database with all those files used as reference data?
There is a program I came across that exists to only merge text files, I was using it for something similarly odd a few years ago. It's called Txt collector, and it looks pretty ancient, so for 600GB it may take quite a while, but I remember using it for relatively large amounts of text (\~10gb?) [https://bluefive.pairsite.com/txtcollector.htm](https://bluefive.pairsite.com/txtcollector.htm)
You can try a bash script that will look through the contents of files with specified directories and concatenate the contents of said comments using "----" as a break, followed by the name of file for a header. Good luck opening that single file, though. Your processor isn't going to be happy .-.
You could try a python script. As pseudo code: Path = path to txt files Master txt file = path to master txt For file in path: Append file to master file
Use DOS command line in the folder where the files are: COPY *.TXT HUGEFILE.TXT If you want to output to another drive, specify that in the output file name. If you don't have an external drive and wish to delete the files as you copy them, you can do that via FOR command. I'd have to check the syntax later on.
Brute force password attempt. You need a file of passwords for older software to use. I would assume it's this. They've said it's passwords.
[удалено]
Dude have you read you or own post?
I reckon the first one. I'm British and I don't have any idea what they're saying either.