quetzalcoatl42
Rising Star
Hey everyone,
I'm working on a "learning by doing" local AI project and could use some collective brainstorming.
I extracted all the teks and guides from DMT-Nexus Wiki with a python script to train a local LLM as an expert on botanical and chemical extraction methods. My rig runs Ubuntu 24.04 with a RTX 3090 (24GB VRAM).
I've run the training a couple of times already, but the results from the final model are still highly suboptimal.
The key factor seems to be my .jsonl training material, which I have attached to this post. I built this dataset using a Synthetic Data Generation (SDG) pipeline.
A quick note on the SDG approach:
The raw wiki texts are heavily polluted with forum slang, street jargon, and inconsistent formatting. You should never feed that kind of raw data directly into a fine-tuning framework like Axolotl or Unsloth—it's the classic "garbage in, garbage out" trap. The intermediate step the entire industry is using now is Synthetic Data Generation. I used a local cleaner model as a filter to read the dirty text, strip out the slang, translate it into formal scientific English, and output structured Q&A pairs.
The Issue:
Although this intermediate step generated data that looks structurally okay at first glance, the target model isn't learning the actual chemistry mechanics. I suspect the SDG process fundamentally flawed the training material in one of the following ways:
Please take a look at the data and help me adjust the training material so the fine-tuning actually yields a usable model.
Thanks in advance!
I'm working on a "learning by doing" local AI project and could use some collective brainstorming.
I extracted all the teks and guides from DMT-Nexus Wiki with a python script to train a local LLM as an expert on botanical and chemical extraction methods. My rig runs Ubuntu 24.04 with a RTX 3090 (24GB VRAM).
I've run the training a couple of times already, but the results from the final model are still highly suboptimal.
The key factor seems to be my .jsonl training material, which I have attached to this post. I built this dataset using a Synthetic Data Generation (SDG) pipeline.
A quick note on the SDG approach:
The raw wiki texts are heavily polluted with forum slang, street jargon, and inconsistent formatting. You should never feed that kind of raw data directly into a fine-tuning framework like Axolotl or Unsloth—it's the classic "garbage in, garbage out" trap. The intermediate step the entire industry is using now is Synthetic Data Generation. I used a local cleaner model as a filter to read the dirty text, strip out the slang, translate it into formal scientific English, and output structured Q&A pairs.
The Issue:
Although this intermediate step generated data that looks structurally okay at first glance, the target model isn't learning the actual chemistry mechanics. I suspect the SDG process fundamentally flawed the training material in one of the following ways:
- Over-Sanitization: The cleaner model successfully removed the slang but aggressively dropped critical "tek" variables (exact pH levels, solvent ratios, precise temperatures, or wait times).
- Prompt Homogeneity: The generated questions are too identically structured, causing the Unsloth model to overfit on the phrasing rather than learning the concepts.
- Formatting: The ChatML structure (System/User/Assistant roles) might have hidden syntax flaws.
Please take a look at the data and help me adjust the training material so the fine-tuning actually yields a usable model.
Thanks in advance!