1. Gabatarwa
Samfuran Harshe (LMs) suna da iyaka ta asali saboda ƙamus ɗin da aka ƙayyade a baya. Wannan iyakancewa yana bayyana a matsayin rashin ƙarfin fahimta ga sabbin kalmomi ko kalmomin da ba a sani ba (OOV) da kuma ƙirƙirar haɗin kalmomi marasa inganci, wanda ke hana sassauci a cikin aikace-aikace daban-daban. Duk da cewa an gabatar da hanyoyin ƙamus masu ƙarfi don haɓaka ƙirƙira, aiwatar da abubuwan da suke akwai suna fama da rarrabuwar lambobi, rashin tallafi ga Samfuran Harshe Manya (LLMs) na zamani, da iyakancewar ƙarfin fassara. An gabatar da DVAGen a matsayin cikakken tsarin buɗaɗɗen tushe, wanda aka ƙera don shawo kan waɗannan ƙalubalen, yana ba da kayan aikin ƙungiya don horarwa, tantancewa, da kuma ganin samfuran harshe masu ƙarfin ƙamus a cikin lokaci guda.
2. Bayanan Baya & Ayyukan Da Suka Danganta
Hanyoyin rarraba kalmomi na gargajiya kamar Byte-Pair Encoding (BPE) da WordPiece sun dogara da ƙamus ɗin da aka ƙayyade, suna fama da kalmomi na musamman ko jimloli masu yawan kalmomi. Haɓakawa kamar Multi-Word Tokenization (MWT) suna ƙara n-grams masu yawa amma har yanzu suna tsaye bayan horarwa. Hanyoyin da aka haɓaka ta hanyar dawo da bayanai, kamar RETRO da tsarin Copy-is-All-You-Need (CoG), suna haɗa ilimi na waje amma galibi suna haifar da jinkiri mai yawa. DVAGen ya ginu a kan wannan yanayin, yana nufin samar da daidaitaccen, ingantaccen, da ingantaccen aiwatar da dabarun ƙamus masu ƙarfi don LLMs na zamani.
3. Tsarin DVAGen
An tsara DVAGen a matsayin tsarin ƙungiya mai faɗaɗawa don sauƙaƙe haɓaka samfuran harshe masu ƙarfin ƙamus.
3.1 Tsarin Gini na Asali & Ƙirar ƙungiya
Tsarin ya raba mahimman sassa—sarrafa bayanai, haɗa samfura, horarwa, fassara, da tantancewa—zuwa ƙungiyoyi daban-daban. Wannan yana ba masu bincike da masu haɓakawa damar keɓancewa ko maye gurbin sassa ɗaya (misali, tsarin dawo da bayanai ko aikin ƙima) ba tare da sake fasalin dukan tsarin ba. Yana goyan bayan haɗin kai tare da LLMs na buɗaɗɗen tushe da suke akwai.
3.2 Hanyar Horarwa
DVAGen yana ba da cikakkiyar hanyar horarwa (`train`) wacce ta haɗa da manufofin koyon ƙamus mai ƙarfi tare da daidaitaccen ƙirar samfurin harshe. An tsara shi don yin aiki tare da LLMs na tushe daban-daban, yana sauƙaƙar haɗin ingantaccen sigogin samfurin da ikonsa na zaɓi daga cikin jerin jimlolin da aka ƙayyade yayin ƙirƙira.
3.3 Kayan Aikin Fassara & Ganewa
Wani sabon abu shine samar da duka Kayan Aikin Umarni-Layi (CLI) (`chat`, `eval`) da WebUI don amfani mai ma'amala. WebUI yana ba da damar duba sakamakon ƙirƙira a cikin lokaci guda, yana nuna waɗanne abubuwan ƙamus masu ƙarfi aka dawo da su kuma aka zaɓa, yana ba da mahimman bayyanawa a cikin tsarin yanke shawara na samfurin.
4. Aiwatar da Fasaha
4.1 Tsarin Ƙamus Mai Ƙarfi
A cikin tsakiyarsa, DVAGen yana aiwatar da tsarin ƙirƙira da aka haɓaka ta hanyar dawo da bayanai. Yayin fassara, don wani mahallin da aka bayar, tsarin yana dawo da jerin jimlolin da aka ƙayyade $C = \{c_1, c_2, ..., c_k\}$ daga cikin tarin bayanai mai ƙarfi. Ana ƙididdige kowane ɗan takara bisa ga alaƙarsa da mahallin da kuma yuwuwar sa a ƙarƙashin samfurin harshe na asali. Yuwuwar ƙirƙira ta ƙarshe don jerin alamomi haɗin nauyi ne na rarraba LM na daidaitaccen da ƙididdiga daga ƴan takara masu ƙarfi. A hukumance, yuwuwar ƙirƙira sashe na gaba ana iya bayyana shi azaman cakuda:
$P(\text{sashe} | \text{mahalli}) = \lambda P_{LM}(\text{sashe} | \text{mahalli}) + (1-\lambda) \sum_{c \in C} \text{sim}(\text{mahalli}, c) \cdot P_{LM}(c | \text{mahalli})$
inda $\lambda$ shine ma'aunin daidaitawa kuma $\text{sim}(\cdot)$ aikin ƙididdiga ne na dacewa.
4.2 Ingantaccen Fassara na Rukuni
Don magance jinkirin fassara, DVAGen yana aiwatar da sarrafa rukuni don dawo da ƙamus mai ƙarfi da matakan ƙididdiga. Ta hanyar sarrafa jerin shigarwa da yawa lokaci guda, yana rage yawan aikin tambayar tushen ilimi na waje da aiwatar da ƙididdiga na dacewa, yana haifar da ingantacciyar ci gaba a cikin ƙarfin aiki idan aka kwatanta da sarrafa jeri.
5. Sakamakon Gwaji & Tantancewa
Takin ya tabbatar da DVAGen akan LLMs na zamani (fiye da GPT-2). Sakamako mahimman sun nuna:
- Ingantaccen Ƙirar Samfurin Harshe: Rage rudani akan jerin gwajin da ke ɗauke da sharuɗɗan OOV da kalmomin ƙwararru na yanki, yana tabbatar da ingancin tsarin wajen sarrafa sabon ƙamus.
- Haɓaka Ƙarfin Fassara: Tallafin fassara na rukuni ya haifar da haɓaka da ake iya auna a cikin alamomin da aka ƙirƙira a kowace dakika, yana rage gabaɗayan jinkiri don yanayin samarwa.
- Bincike na Halitta: Ganin WebUI ya bayyana cewa samfurin yana dawo da kuma haɗa da ingantattun maganganu masu yawan kalmomi (misali, sunayen haɗin fasaha kamar "tsarin kulawa" ko "ɓataccen gradient") waɗanda in ba haka ba za a rarrabasu ta hanyar mai rarraba kalmomi mai tsayi.
Bayanin Jadawali: Jadawali na hasashe zai nuna "Alamomi a kowace Dakika" akan y-axis, yana kwatanta "Daidaitaccen Fassara LM," "DVAGen (Jerin Guda ɗaya)," da "DVAGen (Girman Rukuni=8)" akan x-axis, tare da sigar rukuni yana nuna haɓaka aikin gaba ɗaya.
6. Tsarin Bincike & Nazarin Lamari
Nazarin Lamari: Ƙirƙirar Takaddun Fasaha
Yi la'akari da yanayin da LLM ke buƙatar ƙirƙira rubutu game da sabuwar fasaha mai saurin ci gaba (misali, "Lissafin Neuromorphic"). Samfurin ƙamus mai tsayi zai iya rarraba wannan a matsayin ["Neuro", "morphic", "Comput", "ing"], yana rasa haɗin ma'ana. Ta amfani da tsarin DVAGen:
- Mahalli: An ƙarfafa samfurin da "Fa'idodin..."
- Dawo da Bayanai: Ƙungiyar ƙamus mai ƙarfi tana dawo da jimlolin da aka ƙayyade kamar ["lissafin neuromorphic", "cibiyoyin sadarwar jijiya masu tsinke", "kayan aikin ingantaccen makamashi"] daga cikin tarin fasaha da aka tsara.
- Ƙididdiga & Haɗawa: Tsarin yana ƙididdige waɗannan ƴan takara. "lissafin neuromorphic" yana samun babban maki na dacewa.
- Ƙirƙira: Samfurin yana ƙirƙira "...lissafin neuromorphic ya haɗa da ƙarancin amfani da wutar lantarki da iyawar sarrafa lokaci guda," yana amfani da jimlar da aka dawo da ita azaman rukuni mai haɗin kai. WebUI zai haskaka wannan jimlar a matsayin wacce ta samo asali daga ƙamus mai ƙarfi.
7. Aikace-aikace na Gaba & Jagorori
Tsarin DVAGen yana buɗe hanyoyi masu ban sha'awa da yawa:
- Mataimakan Ƙwararru na Yanki: Daidaita LLMs na gabaɗaya cikin sauri zuwa fagage kamar shari'a, likitanci, ko kuɗi ta hanyar haɗa ƙamus masu ƙarfi na abubuwan da suka gabata na shari'a, ilimin likitanci (misali, UMLS), ko kalmomin kuɗi.
- NLP na Harsuna Da Yawa & Ƙarancin Albarkatu: Haɗa jimloli daga harsuna da yawa ko bambance-bambancen yare a cikin lokaci don inganta aiki don harsunan da ba a wakilta su ba ba tare da cikakken sake horar da samfurin ba.
- Haɗa Ilimi a cikin Lokaci Guda: Haɗa tsarin tare da zane-zane na ilimi da aka sabunta akai-akai ko ciyarwar labarai, yana ba da damar LMs su ƙirƙira abun ciki wanda ke nuni ga abubuwan da suka faru kwanan nan ko wallafe-wallafe, kama da ingantaccen nau'i na ƙirƙira da aka haɓaka ta hanyar dawo da bayanai (RAG).
- Ƙirƙirar Lamba: Haɓaka LLMs na lamba ta hanyar dawo da kuma amfani da sa hannun API, sunayen ayyukan ɗakin karatu, ko tsarin lamba na gama gari daga tushen lamba, inganta daidaito da rage ruɗi na hanyoyin da ba su wanzu ba.
8. Nassoshi
- Radford, A., et al. (2019). Samfuran Harshe Masu Koyon Ayyuka Da Yawa Ba tare da Kulawa ba. OpenAI Blog.
- Devlin, J., et al. (2019). BERT: Horarwa na Farko na Masu Canji Masu Zurfi Biyu don Fahimtar Harshe. NAACL-HLT.
- Borgeaud, S., et al. (2022). Inganta Samfuran Harshe ta hanyar Dawo da daga Tiriliyoyin Alamomi. ICML.
- Lan, Y., et al. (2023). Kwafi-Shine-Duk-Abukatar ku: Tsarin Mataki Biyu don Ƙirƙirar Ƙamus Mai Ƙarfi. arXiv preprint arXiv:2305.xxxxx.
- Gee, A., et al. (2023). Rarraba Kalmomi Da Yawa don Haɓaka Ƙamus na Samfurin Harshe. ACL.
- Liu, N., et al. (2024). Koyon Ƙamus Mai Ƙarfi don Samfuran Harshe na Furotin. NeurIPS.
- Grattafiori, A., et al. (2024). Garken Samfuran Llama 3. Meta AI.
- Yang, S., et al. (2025). Qwen2.5: Tsara na Gaba na Samfuran Harshe Manya Masu Buɗaɗɗen Tushe. Ƙungiyar Alibaba.
9. Binciken Ƙwararru & Fahimta
Fahimta ta Asali: DVAGen ba wani ƙarin kayan aiki ba ne kawai; yana da matakin dabarun aiwatar da mahimmin ra'ayi na bincike amma ba a bincika shi sosai ba—ƙamus mai ƙarfi—don tarin LLM na zamani. Duk da cewa takardu kamar asalin CycleGAN (Zhu et al., 2017) sun gabatar da sabon tsarin don fassarar hoto mara haɗin kai, ƙimarsa ta fashe ta hanyar aiwatar da buɗaɗɗen tushe waɗanda suka daidaita amfani da shi. DVAGen yana nufin yin haka don ƙamus mai ƙarfi, yana canza shi daga ra'ayi na ilimi zuwa kayan aikin mai aiki. Ainihin fahimta shine gane cewa matsalar daidaitawar LLM ba koyaushe girman samfurin ba ne, amma tsayayyen mai rarraba kalmomi. Ta hanyar sanya wannan ɓangaren ya zama mai ƙarfi, DVAGen yana kai hari ga wani iyaka na asali.
Kwararar Ma'ana: Ma'anar takardar tana da ban sha'awa: (1) Ƙamus masu tsayi sanannen Achilles heel ne. (2) Maganganun da suka gabata sun wanzu amma suna da rikice-rikice kuma ba sa ƙaruwa. (3) Saboda haka, mun gina tsari mai tsabta, ƙungiya, mai shirye-shiryen samarwa (DVAGen) wanda ke magance matsalolin haɗawa da ƙarfi. (4) Mun tabbatar da cewa yana aiki akan LLMs na zamani kuma muna nuna fa'idodi na zahiri (fassara na rukuni, ganewa). Kwararar daga gano matsala har zuwa ingantaccen mafita, wanda aka tabbatar, yana bayyana kuma yana da kyau ga masu saka hannun jari.
Ƙarfi & Kurakurai: Babban ƙarfin shine cikakke. Bayar da CLI, WebUI, horarwa, da tantancewa a cikin fakitin ɗaya yana rage matakin karɓuwa sosai, yana tunawa da yadda dandamali kamar ɗakin karatu na Masu Canji na Hugging Face suka ba da damar samun samfurin. Mayar da hankali kan fassara na rukuni nasara ce ta injiniya. Duk da haka, kuskure yana cikin zurfin tantancewa. PDF yana nuna alamar tabbatarwa amma ya rasa ƙaƙƙarfan lambobi, kwatanta da tsarin RAG na zamani ko cikakkun nazarin bincike kan tasirin ingancin dawo da bayanai. Shin ƙamus mai ƙarfi wani lokaci yana gabatar da ƴan takara "masu hayaniya" waɗanda ke rage aikin? An tabbatar da amfanin tsarin, amma cikakkiyar fa'idar gasa tana buƙatar ƙarin ingantaccen ma'auni, kamar yadda aka gani a cikin cikakkun tantancewa daga cibiyoyi kamar CRFM na Stanford.
Fahimta Mai Aiki: Ga ƙungiyoyin AI, umarni yana bayyana: Gwada DVAGen akan aikinku mafi mahimmanci na ƙamus. Idan kuna cikin fasahar shari'a, ilimin halittu, ko kowane fanni tare da ƙamus mai ci gaba, wannan tsarin zai iya zama hanya mafi sauri zuwa daidaito fiye da daidaita samfurin sigogi 70B. Yi la'akari da tarin ƙamus mai ƙarfi a matsayin kadara na farko—tsarinsa zai zama mahimmi kamar injiniyan gaggawa. Bugu da ƙari, ba da gudummawa ga yanayin. Ƙirar ƙungiya tana gayyatar ƙari; gina mai dawo da bayanai na musamman don yankinku zai iya zama babban abin bambanta. DVAGen yana wakiltar sauyi zuwa ga ƙarin tsarin AI na ƙungiya, haɗin kai, kuma haɗin kai na farko yana ba da fa'ida mai aiki.