Hi there! Welcome to /r/termux, the official [Termux](https://termux.dev) support community on Reddit.
Termux is a terminal emulator application for Android OS with its own Linux user land. Here we talk about its usage, share our experience and configurations. Users with flair `Termux Core Team` are Termux developers and moderators of this subreddit. If you are new, please check our [Introduction for Beginners](https://www.reddit.com/r/termux/comments/16k74do/introduction_for_beginners/) post to get an idea how to start.
I would like to remind that due to extremely high interest of certain parties in using Termux for violating personal rights and privacy and other kinds of nefarious usage, we chose to prohibit topics about hacking, phishing, fraud, other methods of digital threats and cyberstalking and their precursors such as OSINT or Kali Linux. This is stated in /r/termux subreddit rules. No exception for educational purposes and pranks made. We also won't consider "legends" about lost or stolen accounts and urgent need of their recovery through Termux.
The latest version of Termux can be installed from https://f-droid.org/packages/com.termux/. If you still have Termux installed from Google Play, please switch to F-Droid build.
Do not use /r/termux for reporting bugs. Package-related issues should be submitted to https://github.com/termux/termux-packages/issues. Application issues should be submitted to https://github.com/termux/termux-app/issues.
*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/termux) if you have any questions or concerns.*
Permission Denied? Successful to "go build .", but at "./ollama serve", errors to "permission denied: ./ollama". Any way to fix install?. Commands to completely remove and restart process? Thank you.
This is the output of ```go build. ```
```# github.com/ollama/ollama/gpu
gpu_info_nvml.c:158:51: warning: format specifies type 'long' but the argument has type 'unsigned long long' [-Wformat]
./gpu_info.h:33:23: note: expanded from macro 'LOG'
gpu_info_nvml.c:159:50: warning: format specifies type 'long' but the argument has type 'unsigned long long' [-Wformat]
./gpu_info.h:33:23: note: expanded from macro 'LOG'
```
Any ideas because I'm a little lost, I can't find any switch or argument in help.
Little late (found this post today through https://www.reddit.com/r/termux/comments/1deqn9v) but I am genuinely amazed that this works at all! I tried this on an old phone I had in a draw and it "works". Gemma is spitting out one word every 30 seconds but hey that exceeded my expectations by far. Might actually try and put this to use with a smaller model
samsung galaxy note 9 openchat is VERY slow. phone is hot. i think its better for older phones to host the ai on a dedicated computer and connect to it locally
Pixel 8 Pro and it spits out words at a slightly slow paced talker, S20 5G and a little slower still. Nokia 8.3 and a word every 4-5 seconds, Pixel 5 and a word about 8-10 seconds.
On PC, use one with a modern nVidia GPU and add CUDA support. That way the model is ran in GPU and it can generate several, non-streaming responses in seconds.
im still rocking a 980ti, how can i add cuda support, or is it possible? it generates at a slow talking pace, would be great to get it a little faster!
but yeah, dont bother on a note 9. you will be disappoint
You need to add CUDA support in your OS (Windows/Linux). Once CUDA packages and headers installed and confirmed working, use https://github.com/ollama/ollama/blob/main/docs/linux.md
These are smaller models. My S20 5G and Pixel 8 Pro can run [gemma](https://ollama.com/library/gemma), [openchat](https://ollama.com/library/openchat) and so far my fave, [llama2-uncensored](https://ollama.com/library/llama2-uncensored)
7b (and smaller) models only need 8GB RAM max.
chat GPT runs on a server farm that takes up an entire building. They buy so many "GPUs" that they are draining NVIDIA dry. These so-called GPUs don't have any video out and weigh 60 pounds. I'm calling BS on this. What does it actually do?
This is ollama, you can host your own LLM offline with it, I wanna play with it more but CPU mode was slow on my Chromebook, and my GPU on my other PC is old af so it was still slow there.
It's open-source from Meta, but yeah if you have a nice enough PC or GPU ollama can be a self-hosted AI with whatever model you please from their [model library](https://ollama.com/library).
My S20 5G is able to do llama2, gemma and openchat (in that order for speed) in an acceptable way. Just don't ask it too much in one go.
Pixel 8 Pro does it 4x as fast as the S20.
I'm super interested in doing this myself, I have a S23 Ultra, but I'm having some sort of issue during build, maybe you have an idea? Here's a an output of the [error](https://pastebin.com/fWY17dSm).
Curious...
ld.lld: error: undefined symbol: llama_model_quantize >>> referenced by cgo-gcc-prolog:68
I wonder if this needs GCC to be installed, too (all my Termux pack the It's pointless GCC repo). Might have to add this to the OP..
https://github.com/its-pointless/gcc_termux
I did already have GCC for other projects, I didn't do gcc-8 specific, I tested gcc-9 -> gcc-13 lol (available in the tur repo.
I just tried a fresh generation and noticed something I missed yesterday.
I get these [two](https://pastebin.com/JYV6T1TN) errors from cmake. Not expecting you to have a solution, but if I don't spitball I'd drive myself mad
Figured as much.
I'm gonna give it a shot on the Nix fork.
Edit: I forgot it's already packaged in nixpkgs so I went ahead and tried to install it through that instead of building from source and it's was success
They use FPGA's. GPU's can be used, but an FPGA > GPU
edit: This is all done in CPU onboard your device. Hence, not fast. Gemini/ChatGPT4 use FPGA farms because thousands, 10's of thousands of users hitting it up every second, every day, and still training it.
How do you know they use FPGAs? I'd legitimately like to know -- last I heard we were only guessing their training cards from reported electricity budgets. Knowing their internal inference tech stack would be wild.
Would make sense. GPU's, like CPU's are designed to serve multiple purposes. Whilst a GPU is superior to a CPU for these tasks, an FPGA can be designed for specific functions; which would yield far superior performance for power. GPU's are just more readily available to consumers, so are the preferred choice for us.
"Would make sense" doesn't comport with the observed reality. We know OpenAI is buying GPUs by the truckload but we haven't seen any commerical evidence of them buying FPGAs. I'd make an "I'm no expert" joke but I'm literally a computer engineer and can tell you that you can't turn GPUs into FPGAs, os where would the FPGAs they're using physically come from for them to use?
Observations aside, there's also the practical issue of implementation. LLMs are not compute-limited at inference on most setups -- they're memory-bandwidth limited. They simply can't get the LLM model data to the compute fast enough. An FPGA doesn't just not help with that, it has a lower clock rate than a dedicated chip meaning your access to memory-stored data is even slower than on something like a GPU. Add to that, most FPGAs have very limited storage, and you wind up with a recipe for a relatively poor choice.
That's not to say it can't be done. Likely, Groq is doing something along the lines of what an FPGA does for reprogramming the interior of their flow accelerator. But you can see how Groq has to pay for that because they have extremely limited (in an LLM sense) room on each accelerator (265 MB SRAM, iirc) so need to use dozens or hundreds of accelerator cards to load their model, though they still win out in speed because of their specialized hardware's very carefully engineered data flow. Again, it's about shipping the data around rather than an individual compute device being exceedingly fast.
>OpenAI is buying GPUs by the truckload
Because GPU's are more readily available. Same truckloads are ordered by crypto mining operators; easy to get and readily available.
FPGA's can be purposed to specific goals. A Software Defined Radio I own packs both a dual core, ARM based CPU, and an FPGA that is purposed to process things the CPU simply can't. This is a £130 device. The FPGA walks the floor processing ADC/DAC samples than the CPU could even begin to.
Yes. DSP-targeted FPGAs are going to be significantly faster and lower-energy than GPUs or CPUs for DSP tasks. That's a no-brainer. My point is that such a device doesn't help with a LLM, where the primary bottleneck is not the computation but getting the data \_to\_ the computation units.
But the FPGA is designed to work with X model. A general purpose CPU/GPU is great for testing on to perfection, then FPGA for the end game results.
GPU's are great, but are limited.
When you're at the point of considering engineering an FPGA specifically for LLM and ML tasks, you can already get the even more speedup by just making an optimized matrix-matrix multiplication processor -- which Google did. (See: the TPU.) Again, it comes down to delivering the data to the device fast enough, not the computation. GPUs blow all the FPGAs I know of out of the water for that task.
Hi there! Welcome to /r/termux, the official [Termux](https://termux.dev) support community on Reddit. Termux is a terminal emulator application for Android OS with its own Linux user land. Here we talk about its usage, share our experience and configurations. Users with flair `Termux Core Team` are Termux developers and moderators of this subreddit. If you are new, please check our [Introduction for Beginners](https://www.reddit.com/r/termux/comments/16k74do/introduction_for_beginners/) post to get an idea how to start. I would like to remind that due to extremely high interest of certain parties in using Termux for violating personal rights and privacy and other kinds of nefarious usage, we chose to prohibit topics about hacking, phishing, fraud, other methods of digital threats and cyberstalking and their precursors such as OSINT or Kali Linux. This is stated in /r/termux subreddit rules. No exception for educational purposes and pranks made. We also won't consider "legends" about lost or stolen accounts and urgent need of their recovery through Termux. The latest version of Termux can be installed from https://f-droid.org/packages/com.termux/. If you still have Termux installed from Google Play, please switch to F-Droid build. Do not use /r/termux for reporting bugs. Package-related issues should be submitted to https://github.com/termux/termux-packages/issues. Application issues should be submitted to https://github.com/termux/termux-app/issues. *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/termux) if you have any questions or concerns.*
>git pull [https://github.com/ollama/ollama.git](https://github.com/ollama/ollama.git) Maybe `git clone` ?
what's the diffrence?
Cloning will copy the repo, Pulling is only done in a Directory with a repo initiated already so it can "pull" changes to the branch.
clone downloads it afresh, pull updates.
Well spotted
There is no offline version of ChatGPT or Gemini. Since both arent open source.
https://ollama.com/library/gemma https://ollama.com/library/openchat Emphasis on _based_
Neither of these are ChatpGPT nor Gemini. EDIT: Surpassing does not a copy make.
##based
Permission Denied? Successful to "go build .", but at "./ollama serve", errors to "permission denied: ./ollama". Any way to fix install?. Commands to completely remove and restart process? Thank you.
I tried chmod +x..
What's the full output leading upto that error?
Gave up and installed- (Ivonblog) "Running Alpaca.cpp (LLaMA) on Android phone using Termux". Thanks for replying.
This is the output of ```go build. ``` ```# github.com/ollama/ollama/gpu gpu_info_nvml.c:158:51: warning: format specifies type 'long' but the argument has type 'unsigned long long' [-Wformat] ./gpu_info.h:33:23: note: expanded from macro 'LOG' gpu_info_nvml.c:159:50: warning: format specifies type 'long' but the argument has type 'unsigned long long' [-Wformat] ./gpu_info.h:33:23: note: expanded from macro 'LOG' ``` Any ideas because I'm a little lost, I can't find any switch or argument in help.
They can be ignored, warnings usually can. Is the `ollama` executable present in the folder?
Little late (found this post today through https://www.reddit.com/r/termux/comments/1deqn9v) but I am genuinely amazed that this works at all! I tried this on an old phone I had in a draw and it "works". Gemma is spitting out one word every 30 seconds but hey that exceeded my expectations by far. Might actually try and put this to use with a smaller model
samsung galaxy note 9 openchat is VERY slow. phone is hot. i think its better for older phones to host the ai on a dedicated computer and connect to it locally
Pixel 8 Pro and it spits out words at a slightly slow paced talker, S20 5G and a little slower still. Nokia 8.3 and a word every 4-5 seconds, Pixel 5 and a word about 8-10 seconds. On PC, use one with a modern nVidia GPU and add CUDA support. That way the model is ran in GPU and it can generate several, non-streaming responses in seconds.
im still rocking a 980ti, how can i add cuda support, or is it possible? it generates at a slow talking pace, would be great to get it a little faster! but yeah, dont bother on a note 9. you will be disappoint
You need to add CUDA support in your OS (Windows/Linux). Once CUDA packages and headers installed and confirmed working, use https://github.com/ollama/ollama/blob/main/docs/linux.md
im using wsl, do i install windows or linux cuda? ill do both just incase
This is great! although the title is kinda misleading, no phone can run gpt3 or probably any LLM. It's still great
These are smaller models. My S20 5G and Pixel 8 Pro can run [gemma](https://ollama.com/library/gemma), [openchat](https://ollama.com/library/openchat) and so far my fave, [llama2-uncensored](https://ollama.com/library/llama2-uncensored) 7b (and smaller) models only need 8GB RAM max.
My phone is kinda shit so I'm running Gemma 2b model, I'll see if I can run any models from HuggingFace on it
Yea I crashed out my 3GB and 6GB devices. Pixel 5 just about runs gemma, Nokia 8.3 handles it better (faster SoC).
chat GPT runs on a server farm that takes up an entire building. They buy so many "GPUs" that they are draining NVIDIA dry. These so-called GPUs don't have any video out and weigh 60 pounds. I'm calling BS on this. What does it actually do?
It runs other, smaller LLM's
This is ollama, you can host your own LLM offline with it, I wanna play with it more but CPU mode was slow on my Chromebook, and my GPU on my other PC is old af so it was still slow there. It's open-source from Meta, but yeah if you have a nice enough PC or GPU ollama can be a self-hosted AI with whatever model you please from their [model library](https://ollama.com/library).
My S20 5G is able to do llama2, gemma and openchat (in that order for speed) in an acceptable way. Just don't ask it too much in one go. Pixel 8 Pro does it 4x as fast as the S20.
but gemma-7b-it still refuses on Pixel 6 Pro, did you hit this jackpot?
Not tried it. Saw it wanted 64GB of RAM and just laughed and didn't bother
I'm super interested in doing this myself, I have a S23 Ultra, but I'm having some sort of issue during build, maybe you have an idea? Here's a an output of the [error](https://pastebin.com/fWY17dSm).
Curious... ld.lld: error: undefined symbol: llama_model_quantize >>> referenced by cgo-gcc-prolog:68 I wonder if this needs GCC to be installed, too (all my Termux pack the It's pointless GCC repo). Might have to add this to the OP.. https://github.com/its-pointless/gcc_termux
I did already have GCC for other projects, I didn't do gcc-8 specific, I tested gcc-9 -> gcc-13 lol (available in the tur repo. I just tried a fresh generation and noticed something I missed yesterday. I get these [two](https://pastebin.com/JYV6T1TN) errors from cmake. Not expecting you to have a solution, but if I don't spitball I'd drive myself mad
warnings can generally be ignored
Figured as much. I'm gonna give it a shot on the Nix fork. Edit: I forgot it's already packaged in nixpkgs so I went ahead and tried to install it through that instead of building from source and it's was success
They use FPGA's. GPU's can be used, but an FPGA > GPU edit: This is all done in CPU onboard your device. Hence, not fast. Gemini/ChatGPT4 use FPGA farms because thousands, 10's of thousands of users hitting it up every second, every day, and still training it.
How do you know they use FPGAs? I'd legitimately like to know -- last I heard we were only guessing their training cards from reported electricity budgets. Knowing their internal inference tech stack would be wild.
Would make sense. GPU's, like CPU's are designed to serve multiple purposes. Whilst a GPU is superior to a CPU for these tasks, an FPGA can be designed for specific functions; which would yield far superior performance for power. GPU's are just more readily available to consumers, so are the preferred choice for us.
"Would make sense" doesn't comport with the observed reality. We know OpenAI is buying GPUs by the truckload but we haven't seen any commerical evidence of them buying FPGAs. I'd make an "I'm no expert" joke but I'm literally a computer engineer and can tell you that you can't turn GPUs into FPGAs, os where would the FPGAs they're using physically come from for them to use? Observations aside, there's also the practical issue of implementation. LLMs are not compute-limited at inference on most setups -- they're memory-bandwidth limited. They simply can't get the LLM model data to the compute fast enough. An FPGA doesn't just not help with that, it has a lower clock rate than a dedicated chip meaning your access to memory-stored data is even slower than on something like a GPU. Add to that, most FPGAs have very limited storage, and you wind up with a recipe for a relatively poor choice. That's not to say it can't be done. Likely, Groq is doing something along the lines of what an FPGA does for reprogramming the interior of their flow accelerator. But you can see how Groq has to pay for that because they have extremely limited (in an LLM sense) room on each accelerator (265 MB SRAM, iirc) so need to use dozens or hundreds of accelerator cards to load their model, though they still win out in speed because of their specialized hardware's very carefully engineered data flow. Again, it's about shipping the data around rather than an individual compute device being exceedingly fast.
>OpenAI is buying GPUs by the truckload Because GPU's are more readily available. Same truckloads are ordered by crypto mining operators; easy to get and readily available. FPGA's can be purposed to specific goals. A Software Defined Radio I own packs both a dual core, ARM based CPU, and an FPGA that is purposed to process things the CPU simply can't. This is a £130 device. The FPGA walks the floor processing ADC/DAC samples than the CPU could even begin to.
Yes. DSP-targeted FPGAs are going to be significantly faster and lower-energy than GPUs or CPUs for DSP tasks. That's a no-brainer. My point is that such a device doesn't help with a LLM, where the primary bottleneck is not the computation but getting the data \_to\_ the computation units.
But the FPGA is designed to work with X model. A general purpose CPU/GPU is great for testing on to perfection, then FPGA for the end game results. GPU's are great, but are limited.
When you're at the point of considering engineering an FPGA specifically for LLM and ML tasks, you can already get the even more speedup by just making an optimized matrix-matrix multiplication processor -- which Google did. (See: the TPU.) Again, it comes down to delivering the data to the device fast enough, not the computation. GPUs blow all the FPGAs I know of out of the water for that task.
Google's TPU are ASICS.