

Pipe Pipe is better than Newpipe. I use F-droid’s VLC front end for local music because the built in android back end is VLC. For everything else, in browser


Pipe Pipe is better than Newpipe. I use F-droid’s VLC front end for local music because the built in android back end is VLC. For everything else, in browser


deleted by creator


deleted by creator
I haven’t looked into the issue of PCIe lanes and the GPU.
I don’t think it should matter with a smaller PCIe bus, in theory, if I understand correctly (unlikely). The only time a lot of data is transferred is when the model layers are initially loaded. Like with Oobabooga when I load a model, most of the time my desktop RAM monitor widget does not even have the time to refresh and tell me how much memory was used on the CPU side. What is loaded in the GPU is around 90% static. I have a script that monitors this so that I can tune the maximum number of layers. I leave overhead room for the context to build up over time but there are no major changes happening aside from initial loading. One just sets the number of layers to offload on the GPU and loads the model. However many seconds that takes is irrelevant startup delay that only happens once when initiating the server.
So assuming the kernel modules and hardware support the more narrow bandwidth, it should work… I think. There are laptops that have options for an external FireWire GPU too, so I don’t think the PCIe bus is too baked in.
I prefer to run a 8×7b mixture of experts model because only 2 of the 8 are ever running at the same time. I am running that in 4 bit quantized GGUF and it takes 56 GB total to load. Once loaded it is about like a 13b model for speed but is ~90% of the capabilities of a 70b. The streaming speed is faster than my fastest reading pace.
A 70b model streams at my slowest tenable reading pace.
Both of these options are exponentially more capable than any of the smaller model sizes even if you screw around with training. Unfortunately, this streaming speed is still pretty slow for most advanced agentic stuff. Maybe if I had 24 to 48gb it would be different, I cannot say. If I was building now, I would be looking at what hardware options have the largest L1 cache, the most cores that include the most advanced AVX instructions. Generally, anything with efficiency cores are removing AVX and because the CPU schedulers in kernels are usually unable to handle this asymmetry consumer junk has poor AVX support. It is quite likely that all the problems Intel has had in recent years has been due to how they tried to block consumer stuff from accessing the advanced P-core instructions that were only blocked in microcode. It requires disabling the e-cores or setting up a CPU set isolation in Linux or BSD distros.
You need good Linux support even if you run windows. Most good and advanced stuff with AI will be done with WSL if you haven’t ditched doz for whatever reason. Use https://linux-hardware.org/ to see support for devices.
The reason I mentioned avoid consumer e-cores is because there have been some articles popping up lately about all p-core hardware.
The main constraint for the CPU is the L2 to L1 cache bus width. Researching this deeply may be beneficial.
Splitting the load between multiple GPUs may be an option too. As of a year ago, the cheapest option for a 16 GB GPU in a machine was a second hand 12th gen Intel laptop with a 3080Ti by a considerable margin when all of it is added up. It is noisy, gets hot, and I hate it many times, wishing I had gotten a server like setup for AI, but I have something and that is what matters.
Most key stuff is not on GitHub or GitHub is just a mirror. The heir apparent to Linux is Hartman and he moved to Europe a long time ago.
No mobile devices are safe. Those are all proprietary black boxes for hardware. If the shit hits the fan, it is back to dumb phones and x86 computers. Digital doomsday prepers are not sounding all that crazy right now IMO.
I have gotten weird interactions with rate limiting through GitHub because I will not whitelist their stalkerware collector server. They also pushed 2 factor to stalk and exploit through the only documented path they wanted people to take. I quit because of it.


Abstract solutions for content recognition with a bot on a server is not a platform specific issue. The dev is skilled and likely on Matrix too.


It is a bot that identifies CSAM images. They are a very skilled dev. The problem is content recognition on a server. So in abstract, it is the same problem.


Search for posts or contact db0. IIRC they worked with LW admin and others to create a filter for this using a very small AI model. It should be on their Git.


Plan 9


Need max AVX instructions. Anything with P/E cores is junk. Only enterprise P cores have the max AVX instructions. When P/E are mixed the advanced AVX is disabled in microcode because the CPU scheduler is unable to determine if a process thread contains an AVX instruction and there is no asymmetrical scheduler that handles this. Prior to early 12k series Intel, the microcode for P enterprise could allegedly run if swapped manually. This was “fused off” to prevent it, probably because Linux could easily be adapted to asymmetrical scheduling but Windows would probably not. The whole reason W11 had to be made was because of the E-cores and the way the scheduler and spin up of idol cores works, at least according to someone on Linux Plumbers for the CPU scheduler ~2020. There are already asymmetric schedulers in Android ARM.
Anyways I think it was on Gamer’s Nexus in the last week or two that Intel was doing some all P core consumer stuff. I’d look at that. According to chips and cheese, the primary CPU bottleneck for tensors is the bus width and clock management of the L2 to L1 cache.
I do alright with my laptop, but haven’t tried R1 stuff yet. The 70B llama2 stuff that I ran was untenable for CPU only with a 12700 with just CPU. It is a little slower than my reading pace when split with a 16 GB GPU, and that was running a 4 bit quantization version.


Not unless an http port is open too. If the only port is https, you have to have the certificate. Like with my AI stuff it acts like the host is down if I try to connect with http. You have to have the certificate to decrypt anything at all from the host.


Sorta, you have to install your certificate authority into the browser and it might complain about verifying that but it will still connect with the encryption.


deleted by creator


I mean more like a self signed TLS certificate with your own host manually set in the browser. Then only make the TLS port available, or something like that. If you have access to both(all) devices, you should be able to fully encrypt by bruit force and without registering the certificate with anyone. That is what I do with AI at home.


I’ve half ass thought about this but never have tried to actually self host. If you have access to all devices, why not just use your own self signed certificates to encrypt everything and require the certificate for all connections? Then there is never a way to log in or connect right? The only reason for any authentication is to make it possible to use any connection to dial into your server. So is that a bug or a feature. Maybe I’m missing something fundamental in this abstract concept that someone will tell me?


I’ve tried 3 times so far in Python/gradio/Oobabooga and never managed to get certs to work or found a complete visual reference guide that demonstrates a complete working example like what I am looking for in a home network. (Only really commenting to subscribe to watch this post develop, and solicit advice:)
Anyone have shortcuts for modeling complex over center, compliance mechanisms, and bistable auxetic materials? I’m using FreeCAD and trying to just use rough sketches and trial and error to create a bending tube structure that 3d prints vertically but then bends into place like a pop-tube kid’s toy, but only on one side of an otherwise vase mode print design. I’m really pushing the limits of what FreeCAD can loft before edges go wonky and the Part Design workflow is no longer sufficient. I can make a compliant spring easily, but a bistable bend in a tube is at the edge of my learning curve.
What would production read look like in an ideal situation?
llama.cpp is at the core of almost all offline, open weights models. The server it creates is Open AI API compatible. Oobabooga Textgen WebUI is more user GUI oriented but based on llama.cpp. Oobabooga has the setup for loading models with a split workload between the CPU and GPU which makes larger gguf quantized models possible to run. Llama.cpp, has this feature, Oobabooga implements it. The model loading settings and softmax sampling settings take some trial and error to dial in well. It helps if you have a way of monitoring GPU memory usage in real time. Like I use a script that appends my terminal window title bar with GPU memory usage until inference time.
Ollama is another common project people use for offline open weights models, and it also runs on top of llama.cpp. It is a lot easier to get started in some instances and several projects use Ollama as a baseline for “Hello World!” type stuff. It has pretty good model loading and softmax settings without any fuss, but it does this at the expense of only running on GPU or CPU but never both in a split workload. This may seem great at first, but if you never experience running much larger quantized models in the 30B-140B range, you are unlikely to have success or a positive experience overall. The much smaller models in the 4B-14B range are all that are likely to run fast enough on your hardware AND completely load in your GPU memory if you only have 8GB-24GB. Most of the newer models are actually Mixture of Experts architectures. This means it is like loading ~7 models initially, but then only inferencing two of them at any one time. All you need is the system memory or the Deepspeed package (uses disk drive for excess space required) to load these larger models. Larger quantized models are much much smarter and more capable. You also need llama.cpp if you want to use function calling for agentic behaviors. Look into the agentic API and pull history in this area of llama.cpp before selecting what models to test in depth.
Huggingface is the goto website for sharing and sourcing models. That is heavily integrated with GitHub, so it is probably as toxic long term, but I do not know of a real FOSS alternative for that one. Hosting models is massive I/O for a server.