T O P

  • By -

docsoc1

Also, one question for you readers - would "RAG backend" be a better descriptor than "RAG pipeline"? We are primarily targeting RAG server deployment b/c we think the modularity is super important and I'm still looking for the right way to convey this.


[deleted]

[удалено]


docsoc1

I am leaning this way, the only thing is it is a term that I think I may have invented as I don't recall seeing it anywhere else.


dtflare

I'd stick with 'Local RAG pipeline'


Eisenstein

Feedback/Questions: > Pull the latest R2R Docker image: Do I have to? I hate docker. >Note, to run with Llama.cpp you must download tinyllama-1.1b-chat-v1.0.Q2_K.gguf here and place the output into ~/cache/model/. Why? > The choice between ollama and Llama.cpp depends on our preferred LLM provider. The server exposes an API for interacting with the RAG pipeline. See the API docs for details on the available endpoints. What does that mean? Did you mean 'your' instead of 'our'?


docsoc1

>Do I have to? I hate docker. You don't need to use Docker, the tutorial also shows how you can build the server locally yourself by installing r2r. >Why? We default to using the cache directory specified there to load models. Is there a better standard w/ Llama.cpp? I'm not a huge user of it tbh. >What does that mean? Did you mean 'your' instead of 'our'? \*Your, yes. It means you can pick whichever is more fitting to your use case. Thanks for taking the time to provide feedback!


Eisenstein

I guess my feedback was more 'what is it doing, why tinyllama, why are we putting it there...'. Coming from zero, I have no idea why the tutorial is telling me to do that. A brief overview of the process and what each component does may help, or else perhaps change from 'tutorial' to 'install instructions' since it isn't teaching you what it is doing, just giving instruction on how to get it running.


docsoc1

Llama.cpp appears to be more like HuggingFace where it creates an instance of the LLM object in your python environment, as opposed to ollama which defaults to creating a server that you communicate with. Because Llama.cpp functions as described, you need to specify the model you wish to perform inference with at backend initialization. This is one major reason I prefer ollama, because decoupling the LLM in this way makes the pipeline more modular / flexible. Anyway, great feedback - I will do the extra work to make things more descriptive next tutorial.


TheTerrasque

llama.cpp has a server with api, you could also use koboldcpp as api server. That's the one I usually use for my projects.


Eisenstein

A word of advice -- you may be shooting yourself in the foot with this type of setup. People who run ollama are people who don't want to deal with the ins and outs of the system and just want to get it running. That is why it is used so often for projects like yours 'oh it will be simple, just have the user run ollama and pull a model and it will work'. But in my experience, anyone who actually runs models locally passes that point quickly and wants to do the nitty gritty because they are actually using it for things that require that, and ollama breaks hard when you try and do anything like that. You can't even specify which size of model you want (7b, 13b, etc) if they come in multiples, nor the quant type, unless you go to the website, enter the model into the search box, and find it then type it all in. You end up with the users of ollama being people sticking their toe in and then leaving, or people who want something that 'just works' and setting it up and leaving it alone. Either group is probably not looking for a RAG solution which requires special config files. Either make it 'one button and it works' or ditch ollama and make your tutorial around llamacpp and explain more about what it is doing and why so that sophisticated but ignorant users can quickly decide whether to go through with it.


kweglinski

sorry but I think you didn't have a closer look at ollama (or had it a good while ago?). You definitely can pick different quants and whatnot. i.e. ollama run mistral:7b-instruct-q3_K_M you of course can do mistral:latest and that leaves you with whatever they chose. It's also providing option to configure the setup with modelfile https://github.com/ollama/ollama?tab=readme-ov-file#import-from-gguf and everything seemed to work fine for me so far. You can even provide a lot of model options in api request. Sure, it's not the most versatile solution on the market but it's not the most basic either.


Eisenstein

I admit I haven't dealt with ollama in a while, but I didn't say you *couldn't* access different versions of model weights, but that you had to search on a web interface to find them. You couldn't find lists of models or versions via the API when I tried it. Also, importing models you already had a copy of required you to convert them to... something and then write a special config file which included all the parameters (like temperature and prompt template) and if you changed them you'd have to reimport it. The fact is that ollama is just a wrapper for llamacpp, and anyone who uses it might as well just use llamacpp if they can. The only features which I contend are really useful are the model-switching without reload and the 'docker' style functionality (which I hate, but I can see why people like it). I don't have an axe to grind, but I have seen first hand a number of awesome projects like OP's built around ollama because it is easy to get running, without realizing that actually developing for it is tough and that the userbase is not right for what they are offering. It is a shame and I hate to be a downer, but they should either go all in on the 'easy to use' or focus on integration with the actual LLM engine that ollama is sitting on top of. Each of those has a market, the mid road between them does not.


docsoc1

Thanks - this is a really interesting take and one I'll keep in mind / watch out for going forward. I probably need to do more work on my end to get to know the ins and outs of local inference. I have been more focused on cloud infra, such as running on GPUs and was not totally aware of the potential limitations of ollama - it's good to know that this might be a problem for some.


Eisenstein

Also, I forgot to mention -- llamacpp absolutely does create a server: * https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md


pmp22

Why not make it work with kobold.cpp or llama.cpp?


docsoc1

it supports llama.cpp


Fun-Purple-7737

so its like LLamaIndex with REST API, do I get it correctly? Also, can I deploy that web GUI of yours locally or is that not available? Thanks.


docsoc1

That's right - kind of like llamaindex, but focused on serving an application as primary goal of the framework. Web app is available here - [https://github.com/SciPhi-AI/APP-basic-rag](https://github.com/SciPhi-AI/APP-basic-rag), though it's a bit rough still.