• Members of the previous forum can retrieve their temporary password here, (login and check your PM).

WIKI AI Training - Help with LLM Fine-Tuning (Unsloth): Suboptimal results after Synthetic Data Generation (SDG)

quetzalcoatl42

Rising Star
Joined
Jun 22, 2011
Messages
68
Merits
85
Hey everyone,

I'm working on a "learning by doing" local AI project and could use some collective brainstorming.

I extracted all the teks and guides from DMT-Nexus Wiki with a python script to train a local LLM as an expert on botanical and chemical extraction methods. My rig runs Ubuntu 24.04 with a RTX 3090 (24GB VRAM).

I've run the training a couple of times already, but the results from the final model are still highly suboptimal.

The key factor seems to be my .jsonl training material, which I have attached to this post. I built this dataset using a Synthetic Data Generation (SDG) pipeline.

A quick note on the SDG approach:
The raw wiki texts are heavily polluted with forum slang, street jargon, and inconsistent formatting. You should never feed that kind of raw data directly into a fine-tuning framework like Axolotl or Unsloth—it's the classic "garbage in, garbage out" trap. The intermediate step the entire industry is using now is Synthetic Data Generation. I used a local cleaner model as a filter to read the dirty text, strip out the slang, translate it into formal scientific English, and output structured Q&A pairs.

The Issue:
Although this intermediate step generated data that looks structurally okay at first glance, the target model isn't learning the actual chemistry mechanics. I suspect the SDG process fundamentally flawed the training material in one of the following ways:
  1. Over-Sanitization: The cleaner model successfully removed the slang but aggressively dropped critical "tek" variables (exact pH levels, solvent ratios, precise temperatures, or wait times).
  2. Prompt Homogeneity: The generated questions are too identically structured, causing the Unsloth model to overfit on the phrasing rather than learning the concepts.
  3. Formatting: The ChatML structure (System/User/Assistant roles) might have hidden syntax flaws.
I have attached a sample of the generated JSONL training data.

Please take a look at the data and help me adjust the training material so the fine-tuning actually yields a usable model.

Thanks in advance!
 

Attachments

I have attached a sample of the generated JSONL training data.
On a quick glance I've found malformed entries, e.g. line 161:
Code:
{"messages": [{"role": "system", "content": "You are a highly skilled scientific assistant specialized in organic chemistry and botanical extractions. Always provide precise, step-by-step instructions using formal terminology and safety warnings where appropriate."}, {"role": "user", "content": "[Context Tags: DMT]\n\n----------------------------\n\n{{Handbook Transclusion Header}}\n\n{{Procedure|Purification of Spice Fumarate ProductFASA Alteration of Final Purification|\n{{ShowInfo/In Article|Image:Information.png|NOTE|\nThis procedure is generally performed to remove excess fumaric acid based on its low solubility in water and to remove residual solvent by dispersing it and allowing it to evaporate more easily and completely. The resulting product is considered to be the most appropriately purified form for oral administration.\n|0px\n|\n}}\n# Add a small amount of water in slight excess of completely submerging the product.\n# Stir mixture to dissolve the active fumarate component.\n# Decant solution off, leaving solids behind.\n# Repeat process until no more color change is observed in the solution, but perform once more with a minimal amount of water for good measure.\n#* The solids remaining should be mainly composed of the excess fumaric acid from the starting material.\n# Evaporate the solution with or without heat and airflow to achieve purified crystalline fumarate product.\n#* Reduction of the solution by heat and airflow is generally always beneficial, but slow evaporation in the later stages may result in a more uniform and attractive crystalline product.\n}"}, {"role": "assistant", "content": "[{"}]}

That's training for the user to prompt with a bunch of Wiki-style formatted text and the model to reply nothing. There are more like that, your sanitization step seems to be flawed. I'd look into that before anything else.
 
Back
Top Bottom