it is cool to see supporting leaderboards. 0.53 is not too high but still validates both works. my idea is alignment through 'beneficial knowledge for humans', yet still there is correlation!
Emin Temiz PRO
AI & ML interests
Recent Activity
Organizations
it is cool to see supporting leaderboards. 0.53 is not too high but still validates both works. my idea is alignment through 'beneficial knowledge for humans', yet still there is correlation!
- contemplation on text (for further CPT)
- q&a generation (for GRPO)
after doing GRPO, the successful ones go again with a SFT.
almost doubled my dataset. although the new ones are synthetic, they are from important sources and important matters. focusing on controversial claims more than anything else because these actually move models.
started fine tuning qwen 3.6. using vibe coding to play with LoRA adapters. i made lots of LoRAs for qwen 3.5 and now i can apply them to 3.6 except one tensor type. all of MLP matches to 3.6 and most of attentions match to 3.6. that will save me a lot of time. fine tune of 3.6 will probably appear faster, with a better alignment since the dataset is expanded.
started a truth db project where i will compare all the claims in the world with each other and give them a score. claims will fight with each other, supporting or weakening each other. the result hopefully will be very useful for better fine tuning LLMs. it will also automate my curation processes..
"""
Consider a follow-up to an earlier version of the Nature paper. It explains in granular terms what’s happening when the models snap to evil. It is math all the way down. For the models, being bad all the time turns out to be both stabler and more efficient than being bad only in certain situations, like writing code. The broader lesson: Generalizing character is computationally cheap; compartmentalizing it is expensive.
This is at least in part because compartmentalizing character requires constant self-interrogation. The model must constantly ask itself, “Am I supposed to be bad here? Good? Something in between?” Each of those checkpoints is another chance to get things wrong. This is interesting enough in A.I. Extrapolated to humans, the possibility becomes astonishing. Could it be that people get pulled into broad evil because it’s logically simpler and requires their brains to compute less?
"""
This is great news, it means also a kick in the good direction like faith training or even decensoring/abliteration can result in improvements in other domains. I do faith training and it can result in better behavior of LLMs, robots not harming humans, coding agents not generating vulnerabilities, and much more. Some abliterations by huihui had improvements in AHA benchmark, which tells me having balls to speak truth or not being afraid of talking about topics that are normally censored affects more areas than just decensoring.
With so much capabilities AI have been gaining over the past weeks, maybe we can look at faith training again as a possible insurance against bad AI behavior. What do you think?
etemiz/Ostrich-32B-Qwen3-260303-GGUF
Started fine tuning Qwen 3.5 27B. Soon high density intelligence meets human alignment!
I don't do refusal tests but i may in the future.
somebody should do abliteration leaderboard!
did that
with my own conversion to GGUF: 59%
another GGUF (
https://huggingface.co/llmfan46/Qwen3.5-27B-heretic-v2-GGUF/blob/main/Qwen3.5-27B-heretic-v2-Q4_K_M.gguf ): 60%
the question is, does huihui's version become less intelligent after that big of an abliteration.
@huihui-ai well done !
27B
Huihui abliteration 65%
Heretic abliteration 55%
Normal 50%
35B
Huihui abliteration 64%
@jiaojjjjje abliteration 57%
@LeadFootThrottleCock abliteration 56%
Normal 49%
thank you <3
2026 experimental version
https://aha-leaderboard.shakespeare.wtf/2026
https://huggingface.co/etemiz/Ostrich-32B-Qwen3-260217-GGUF
This model has achieved AHA=67 score.
Current AHA Leaderboard: https://aha-leaderboard.shakespeare.wtf/
Read more about AHA https://huggingface.co/blog/etemiz/aha-leaderboard
More quants are coming.
ORPO or GSPO?
I think ORPO is pretty good and fast but GSPO makes it attack its own opinions, reflecting on itself, correcting itself. Although GSPO is much slower, it may still be pretty effective. And for GSPO you don't have to provide the whole reasoning corpus, you just provide the end result (One word maybe to answer a binary question).
And GSPO may be better than GRPO because it is rewarding 'train of thoughts' whereas GRPO is rewarding single tokens. Alignment is mostly train of thoughts, not a single token like a math answer..
i bet RL can generate humility by accident given enough trials. humility, then the model tool calls for more info and trusts in this new information and reorganizes the reply. this of course involves RAG or another aligned LLM.
Thanks, this is insightful.
I liked the "rewrite the claim in 5 different ways". Can be really useful for RAG scenarios.
I liked the idea of detecting hallucination using another aligned LLM, though i don't know how effective it will be.
"not enough info" is probably the hardest. Most LLMs today are trained to say anything rather than being humble, as you said.
i was doing CPT for a while and got decent results. but what if i want to go for perfection? cover all the areas of misalignment using limited datasets. i have to find a way to multiply the material to successfully combat the material of the rest of the internet.
i want to generate SFT datasets but only on controversial topics, because i have to be efficient with limited resources. first i give a smart LLM a 'ground truth' text. then i give it the following prompts:
- You are a highly skilled academic analyst.
- Analyze this text and find 3 bold claims that could cause controversy and division in public. List the claims and also state why they are debatable. Give numbers to the claims.
- Convert these claims into binary questions (that could be answered by yes/no or this/that).
- Now put these questions in a json format. Please also add the info about which of the answers concur with the original text and the question number.
- Write some supporting arguments for 1st question, with respect to the original text, concurring and confirming the original text.
There must be about 300 words. You should not mention the text, write it as if you are the one answering the question.the result is questions and answers with more words along the same ideas. a few sentences of opinions in the beginning, is expanded to lots of words. using this method i can multiply billions of tokens to tens of billions probably and have a more effective training.
next i should do RL maybe. LLMs seem to have all kinds of ideas already installed, yet they don't have the intuition to know which one is true. they can give you a ton of reasons to support anything. given the proper incentives, LLMs then should evolve towards supporting aligned ideas more. the rewards will be like guidance that will kick an LLM towards better answers.
Thanks for the tips.
Is giving different answers for different lengths a bad "behavior" and related to SFT than CPT?
Also, should I give two sets of queries and answers in the context (one short one long) to make it learn that when the length changes, the answer should be parallel?
This could be RL too, like bad behavior of non integrity can be penalized...
Is it normal practice to do 2 rounds of questions in SFT or RL?
The focus of AA-LCR is to replicate real knowledge work and reasoning tasks, testing capability critical to modern AI applications spanning document analysis, codebase understanding, and complex multi-step workflows.
AA-LCR is 100 hard text-based questions that require reasoning across multiple real-world documents that represent ~100k input tokens. Questions are designed so answers cannot be directly found but must be reasoned from multiple information sources, with human testing verifying that each question requires genuine inference rather than retrieval.
Key takeaways:
➤ Today’s leading models achieve ~70% accuracy: the top three places go to OpenAI o3 (69%), xAI Grok 4 (68%) and Qwen3 235B 2507 Thinking (67%)
➤👀 We also already have gpt-oss results! 120B performs close to o4-mini (high), in-line with OpenAI claims regarding model performance. We will be following up shortly with a Intelligence Index for the models.
➤ 100 hard text-based questions spanning 7 categories of documents (Company Reports, Industry Reports, Government Consultations, Academia, Legal, Marketing Materials and Survey Reports)
➤ ~100k tokens of input per question, requiring models to support a minimum 128K context window to score on this benchmark
➤ ~3M total unique input tokens spanning ~230 documents to run the benchmark (output tokens typically vary by model)
We’re adding AA-LCR to the Artificial Analysis Intelligence Index, and taking the version number to v2.2. Artificial Analysis Intelligence Index v2.2 now includes: MMLU-Pro, GPQA Diamond, AIME 2025, IFBench, LiveCodeBench, SciCode and AA-LCR.
Link to dataset: ArtificialAnalysis/AA-LCR