• Members of the previous forum can retrieve their temporary password here, (login and check your PM).

WIKI AI Training - Help with LLM Fine-Tuning (Unsloth): Suboptimal results after Synthetic Data Generation (SDG)

quetzalcoatl42

Rising Star
Joined
Jun 22, 2011
Messages
70
Merits
122
Hey everyone,

I'm working on a "learning by doing" local AI project and could use some collective brainstorming.

I extracted all the teks and guides from DMT-Nexus Wiki with a python script to train a local LLM as an expert on botanical and chemical extraction methods. My rig runs Ubuntu 24.04 with a RTX 3090 (24GB VRAM).

I've run the training a couple of times already, but the results from the final model are still highly suboptimal.

The key factor seems to be my .jsonl training material, which I have attached to this post. I built this dataset using a Synthetic Data Generation (SDG) pipeline.

A quick note on the SDG approach:
The raw wiki texts are heavily polluted with forum slang, street jargon, and inconsistent formatting. You should never feed that kind of raw data directly into a fine-tuning framework like Axolotl or Unsloth—it's the classic "garbage in, garbage out" trap. The intermediate step the entire industry is using now is Synthetic Data Generation. I used a local cleaner model as a filter to read the dirty text, strip out the slang, translate it into formal scientific English, and output structured Q&A pairs.

The Issue:
Although this intermediate step generated data that looks structurally okay at first glance, the target model isn't learning the actual chemistry mechanics. I suspect the SDG process fundamentally flawed the training material in one of the following ways:
  1. Over-Sanitization: The cleaner model successfully removed the slang but aggressively dropped critical "tek" variables (exact pH levels, solvent ratios, precise temperatures, or wait times).
  2. Prompt Homogeneity: The generated questions are too identically structured, causing the Unsloth model to overfit on the phrasing rather than learning the concepts.
  3. Formatting: The ChatML structure (System/User/Assistant roles) might have hidden syntax flaws.
I have attached a sample of the generated JSONL training data.

Please take a look at the data and help me adjust the training material so the fine-tuning actually yields a usable model.

Thanks in advance!
 

Attachments

I have attached a sample of the generated JSONL training data.
On a quick glance I've found malformed entries, e.g. line 161:
Code:
{"messages": [{"role": "system", "content": "You are a highly skilled scientific assistant specialized in organic chemistry and botanical extractions. Always provide precise, step-by-step instructions using formal terminology and safety warnings where appropriate."}, {"role": "user", "content": "[Context Tags: DMT]\n\n----------------------------\n\n{{Handbook Transclusion Header}}\n\n{{Procedure|Purification of Spice Fumarate ProductFASA Alteration of Final Purification|\n{{ShowInfo/In Article|Image:Information.png|NOTE|\nThis procedure is generally performed to remove excess fumaric acid based on its low solubility in water and to remove residual solvent by dispersing it and allowing it to evaporate more easily and completely. The resulting product is considered to be the most appropriately purified form for oral administration.\n|0px\n|\n}}\n# Add a small amount of water in slight excess of completely submerging the product.\n# Stir mixture to dissolve the active fumarate component.\n# Decant solution off, leaving solids behind.\n# Repeat process until no more color change is observed in the solution, but perform once more with a minimal amount of water for good measure.\n#* The solids remaining should be mainly composed of the excess fumaric acid from the starting material.\n# Evaporate the solution with or without heat and airflow to achieve purified crystalline fumarate product.\n#* Reduction of the solution by heat and airflow is generally always beneficial, but slow evaporation in the later stages may result in a more uniform and attractive crystalline product.\n}"}, {"role": "assistant", "content": "[{"}]}

That's training for the user to prompt with a bunch of Wiki-style formatted text and the model to reply nothing. There are more like that, your sanitization step seems to be flawed. I'd look into that before anything else.
 
On a quick glance I've found malformed entries, e.g. line 161:
That's training for the user to prompt with a bunch of Wiki-style formatted text and the model to reply nothing. There are more like that, your sanitization step seems to be flawed. I'd look into that before anything else.
It's not sanitized for sure (contains media wiki tags), but also not malformed.
If it were malformed, unsloth would have complained during training.
What I would need, is help, reading the questions and anwers, making sure they make sense, so the resulting qwen/llama model doesn't hallucinate. I realize that's a lot to ask, but generally, compiling senseful training data is the hardest part.
 
Last edited:
@blig-blug
I think you are probaby right, and the actual problem is more due to the initial media wiki scraper and all the band aid python sanitation scripts I put over afterwards. The best practice would probably be, to start from the beginning and code a decent scraper that really makes sure {ALL tags} are excluded. ^Thx for the input
@dreamer042
If I remember correctly, around 2013, there was a huge catalogue of questions for potential nexians, is there any chance i could get that data?
 
Last edited:
It's not sanitized for sure (contains media wiki tags), but also not malformed.
I mean, just look at the example I provided. Would you really want the assistant's reply to be an empty string? And expect the user prompt to contain Wiki tags?
And it's not the only case in the example you provided, so there are likely many more in the training dataset.

If it were malformed, unsloth would have complained during training.
I don't mean malformed in the sense of syntactically incorrect JSON, that would indeed have been detected.

I think you are probaby right, and the actual problem is more due to the initial media wiki scraper and all the band aid python sanitation scripts I put over afterwards. The best practice would probably be, to start from the beginning and code a decent scraper that really makes sure {ALL tags} are excluded. ^Thx for the input
Yes, and also make sure the assistant's replies aren't set to an empty string, or other malformed replies. I'm not sure how you are generating those, so the problem may go away once tags are properly removed.

What I would need, is help, reading the questions and anwers, making sure they make sense
I think it would make more sense (and be more doable) to randomly sample the dataset for that instead of manually checking all questions and answers.

For the conversion, maybe take a look at Pandoc, it supports MediaWiki as an input format: Pandoc - index So you could do, for example, MediaWiki to Markdown. There are also some third-party Pandoc readers for BBCode, and writing or modifying one is quite easy, it's just a Lua script implementing a given set of functions that are called on the syntax tree of the document being converted.

Good luck with the project and keep us posted :)
 
Back
Top Bottom