WIKI AI Training - Help with LLM Fine-Tuning (Unsloth): Suboptimal results after Synthetic Data Generation (SDG)

quetzalcoatl42 · Mar 29, 2026

Hey everyone,

I'm working on a "learning by doing" local AI project and could use some collective brainstorming.

I extracted all the teks and guides from DMT-Nexus Wiki with a python script to train a local LLM as an expert on botanical and chemical extraction methods. My rig runs Ubuntu 24.04 with a RTX 3090 (24GB VRAM).

I've run the training a couple of times already, but the results from the final model are still highly suboptimal.

The key factor seems to be my .jsonl training material, which I have attached to this post. I built this dataset using a Synthetic Data Generation (SDG) pipeline.

A quick note on the SDG approach:
The raw wiki texts are heavily polluted with forum slang, street jargon, and inconsistent formatting. You should never feed that kind of raw data directly into a fine-tuning framework like Axolotl or Unsloth—it's the classic "garbage in, garbage out" trap. The intermediate step the entire industry is using now is Synthetic Data Generation. I used a local cleaner model as a filter to read the dirty text, strip out the slang, translate it into formal scientific English, and output structured Q&A pairs.

The Issue:
Although this intermediate step generated data that looks structurally okay at first glance, the target model isn't learning the actual chemistry mechanics. I suspect the SDG process fundamentally flawed the training material in one of the following ways:

Over-Sanitization: The cleaner model successfully removed the slang but aggressively dropped critical "tek" variables (exact pH levels, solvent ratios, precise temperatures, or wait times).
Prompt Homogeneity: The generated questions are too identically structured, causing the Unsloth model to overfit on the phrasing rather than learning the concepts.
Formatting: The ChatML structure (System/User/Assistant roles) might have hidden syntax flaws.

I have attached a sample of the generated JSONL training data.

Please take a look at the data and help me adjust the training material so the fine-tuning actually yields a usable model.

Thanks in advance!

blig-blug · Mar 29, 2026

quetzalcoatl42 said:
I have attached a sample of the generated JSONL training data.

On a quick glance I've found malformed entries, e.g. line 161:

Code:

{"messages": [{"role": "system", "content": "You are a highly skilled scientific assistant specialized in organic chemistry and botanical extractions. Always provide precise, step-by-step instructions using formal terminology and safety warnings where appropriate."}, {"role": "user", "content": "[Context Tags: DMT]\n\n----------------------------\n\n{{Handbook Transclusion Header}}\n\n{{Procedure|Purification of Spice Fumarate ProductFASA Alteration of Final Purification|\n{{ShowInfo/In Article|Image:Information.png|NOTE|\nThis procedure is generally performed to remove excess fumaric acid based on its low solubility in water and to remove residual solvent by dispersing it and allowing it to evaporate more easily and completely. The resulting product is considered to be the most appropriately purified form for oral administration.\n|0px\n|\n}}\n# Add a small amount of water in slight excess of completely submerging the product.\n# Stir mixture to dissolve the active fumarate component.\n# Decant solution off, leaving solids behind.\n# Repeat process until no more color change is observed in the solution, but perform once more with a minimal amount of water for good measure.\n#* The solids remaining should be mainly composed of the excess fumaric acid from the starting material.\n# Evaporate the solution with or without heat and airflow to achieve purified crystalline fumarate product.\n#* Reduction of the solution by heat and airflow is generally always beneficial, but slow evaporation in the later stages may result in a more uniform and attractive crystalline product.\n}"}, {"role": "assistant", "content": "[{"}]}

That's training for the user to prompt with a bunch of Wiki-style formatted text and the model to reply nothing. There are more like that, your sanitization step seems to be flawed. I'd look into that before anything else.

quetzalcoatl42 · Mar 31, 2026

On a quick glance I've found malformed entries, e.g. line 161:
That's training for the user to prompt with a bunch of Wiki-style formatted text and the model to reply nothing. There are more like that, your sanitization step seems to be flawed. I'd look into that before anything else.

It's not sanitized for sure (contains media wiki tags), but also not malformed.
If it were malformed, unsloth would have complained during training.
What I would need, is help, reading the questions and anwers, making sure they make sense, so the resulting qwen/llama model doesn't hallucinate. I realize that's a lot to ask, but generally, compiling senseful training data is the hardest part.

quetzalcoatl42 · Mar 31, 2026

@blig-blug
I think you are probaby right, and the actual problem is more due to the initial media wiki scraper and all the band aid python sanitation scripts I put over afterwards. The best practice would probably be, to start from the beginning and code a decent scraper that really makes sure {ALL tags} are excluded. ^Thx for the input
@dreamer042
If I remember correctly, around 2013, there was a huge catalogue of questions for potential nexians, is there any chance i could get that data?

dreamer042 · Mar 31, 2026

I don't have a nice clean list of the questions anymore, but you can always compile them from the answer section on the old forum: Welcome to the DMT-Nexus

I'm very interested in this project. Please keep us updated on your progress.

blig-blug · Mar 31, 2026

quetzalcoatl42 said:
It's not sanitized for sure (contains media wiki tags), but also not malformed.

I mean, just look at the example I provided. Would you really want the assistant's reply to be an empty string? And expect the user prompt to contain Wiki tags?
And it's not the only case in the example you provided, so there are likely many more in the training dataset.

quetzalcoatl42 said:
If it were malformed, unsloth would have complained during training.

I don't mean malformed in the sense of syntactically incorrect JSON, that would indeed have been detected.

quetzalcoatl42 said:
I think you are probaby right, and the actual problem is more due to the initial media wiki scraper and all the band aid python sanitation scripts I put over afterwards. The best practice would probably be, to start from the beginning and code a decent scraper that really makes sure {ALL tags} are excluded. ^Thx for the input

Yes, and also make sure the assistant's replies aren't set to an empty string, or other malformed replies. I'm not sure how you are generating those, so the problem may go away once tags are properly removed.

quetzalcoatl42 said:
What I would need, is help, reading the questions and anwers, making sure they make sense

I think it would make more sense (and be more doable) to randomly sample the dataset for that instead of manually checking all questions and answers.

For the conversion, maybe take a look at Pandoc, it supports MediaWiki as an input format: Pandoc - index So you could do, for example, MediaWiki to Markdown. There are also some third-party Pandoc readers for BBCode, and writing or modifying one is quite easy, it's just a Lua script implementing a given set of functions that are called on the syntax tree of the document being converted.

Good luck with the project and keep us posted

quetzalcoatl42 · May 20, 2026

I wanted to provide a quick update on the project. It certainly hasn't been forgotten, but we are facing a major bottleneck: AI is only as good as its training data.

I’ve spent numerous late nights refactoring and optimizing my Wiki scraper scripts. However, despite these efforts, the quality of the extracted data is still substandard and not yet viable for production.

Realistically, if we want to pursue this path and get the data to where it needs to be, it is a bigger job than one person can handle. I will need additional support and collaboration to overcome these data-quality hurdles.

blig-blug · May 21, 2026

quetzalcoatl42 said:
the quality of the extracted data is still substandard

In what way?

Also, are you attempting a finetune or training from scratch?

TransistorBass · May 21, 2026

What do we need Ai for again?

The more it is pushed on us the less I use it.

blig-blug · May 21, 2026

TransistorBass said:
What do we need Ai for again?

The more it is pushed on us the less I use it.

I understand that sentiment, but not all so-called "AI" is the same. This thread is about training or finetuning a local model that you can use and own yourself, not a service. If you are not interested that's fine, but then you can just ignore it. Being dismissive about it is unnecessary.

TransistorBass · May 21, 2026

Fair enough.
My sentiment will still remain, and if it's not some energy hungry data center killing our planet I'm much happier about it, and I've always got better medical answers from Web searches with a hint of Ai than I get from my local Doctors.

WIKI AI Training - Help with LLM Fine-Tuning (Unsloth): Suboptimal results after Synthetic Data Generation (SDG)

quetzalcoatl42

Esteemed member

Attachments

blig-blug

𐇐𐇐𐇐𐇐𐇐𐇐

quetzalcoatl42

Esteemed member

quetzalcoatl42

Esteemed member

dreamer042

musuq jaguarqa filoyuq

blig-blug

𐇐𐇐𐇐𐇐𐇐𐇐

quetzalcoatl42

Esteemed member

blig-blug

𐇐𐇐𐇐𐇐𐇐𐇐

TransistorBass

Resting Yoga Face😑

blig-blug

𐇐𐇐𐇐𐇐𐇐𐇐

TransistorBass

Resting Yoga Face😑

Similar threads