1. Gabatarwa & Bayyani
Wannan binciken ya dogara ne akan takardar bincike "Alamun cewa tsarin harshe na haruffa yana koyon raka'o'in tsarin kalma da tsarin nahawu na Turanci" na Kementchedjhieva da Lopez (2018). Babban tambayar da aka magance ita ce ko Tsarin Jijiyoyin Maimaitawa (RNNs) na matakin haruffa, musamman LSTMs, sun wuce kawai ƙwaƙwalwar tsarin haruffa na saman zuwa koyon tsarin harshe na zahiri kamar tsarin kalma da rukunin tsarin nahawu.
Yayin da aikin da ya gabata (misali, Chung et al., 2016; Kim et al., 2016) ya yi iƙirarin cewa irin waɗannan samfuran suna da wayewar tsarin kalma, wannan takarda ta ba da shaida kai tsaye ta hanyar gwaje-gwajen bincike na tsari. Marubutan sun yi amfani da samfurin harshe na LSTM na haruffa da aka horar akan rubutun Wikipedia na Turanci don bincika wakilcinsa na ciki da iyawar gama gari.
Babban Jigo:
Takardar tana jayayya cewa samfurin harshe na matakin haruffa zai iya, a ƙarƙashin wasu sharuɗɗa (misali, lokacin da tsarin kalma ya yi daidai da kalmomi), koyon gano raka'o'in harshe mafi girma (tsarin kalma, kalmomi) da kuma riƙe wasu daga cikin kaddarorinsu na asali da ƙa'idodin haɗawa.
2. Tsarin Harshe & Tsarin Gine-gine
Samfurin da ake bincika shine RNN na matakin haruffa 'maras kalma' tare da raka'o'in Ƙwaƙwalwar Gajere Mai Tsayi (LSTM), bin tsarin da Karpathy (2015) ya shahara. Shigarwa shine ci gaba da gudana na haruffa, gami da sarari da ake ɗauka a matsayin alamomi na yau da kullun, ba tare da rarrabe kalmomi a sarari ba.
2.1 Tsarin Samfurin
Samfurin yana aiki kamar haka a kowane lokaci $t$:
- Haɗa Haruffa: Harafin shigarwa $c_t$ ana canza shi zuwa vector mai yawa: $\mathbf{x}_{c_t} = E^T \mathbf{v}_{c_t}$, inda $E \in \mathbb{R}^{|V| \times d}$ shine matrix ɗin haɗawa, $|V|$ shine girman ƙamus na haruffa, $d$ shine girman haɗawa, kuma $\mathbf{v}_{c_t}$ shine vector mai zafi ɗaya.
- Sake Sabunta Matsayin Boye: LSTM tana sabunta matsayinta na ɓoye: $\mathbf{h}_t = \text{LSTM}(\mathbf{x}_{c_t}, \mathbf{h}_{t-1})$.
- Yiwuwar Fitowa: Layer na layi wanda ke biye da softmax yana hasashen harafi na gaba: $p(c_{t+1} = c | \mathbf{h}_t) = \text{softmax}(\mathbf{W}_o \mathbf{h}_t + \mathbf{b}_o)_i$ ga duk $c \in V$, inda $i$ shine fihirisar $c$.
2.2 Cikakkun Bayanai na Horarwa
An horar da samfurin akan alamomin haruffa miliyan 7 na farko daga Wikipedia na Turanci, wanda aka gabatar a matsayin ci gaba da gudana. Wannan tsari yana tilasta wa samfurin gano iyakokin kalma da tsarin kalma daga tsarin rarraba kawai.
3. Babban Bincike & Shaida
Marubutan sun yi amfani da dabaru da yawa na bincike don gano abin da samfurin ya koya.
3.1 Ayyukan Tsarin Kalma Masu Samarwa
Samfurin yana nuna iyawar yin amfani da ƙa'idodin tsarin kalma na Turanci mai samarwa. Misali, lokacin da aka yi masa tambaya game da tushe na sabon abu, zai iya samar da siffofi masu ma'ana ko siffofi da aka samo, yana nuna cewa ya ware raka'o'in tsarin kalma (misali, gane "-ed" a matsayin ƙari na lokacin da ya wuce) maimakon kawai ƙwaƙwalwar dukan kalmomi.
3.2 Gano "Raka'ar Iyaka"
Wani muhimmin binciken shine gano takamaiman raka'ar ɓoye a cikin LSTM wacce koyaushe ke nuna babban aiki a iyakar kalma (sarari). Wannan raka'a tana aiki yadda ya kamata a matsayin mai rarraba kalma da aka koya. Muhimmanci, tsarin aikinta ya faɗaɗa zuwa iyakar tsarin kalma a cikin kalmomi (misali, a mahadar "un" da "happy"), yana ba da bayanin injiniya game da yadda samfurin ke gano raka'o'in ƙasa da kalma.
3.3 Koyon Iyakokin Tsarin Kalma
Gwaje-gwaje sun nuna samfurin yana koyon iyakokin tsarin kalma ta hanyar ƙididdige daga mafi yawan sigina mai haske na iyakokin kalma. Tsarin kididdiga na sarari yana ba da tsarin gano tsarin tsarin kalma na ciki.
3.4 Rike Bayanan Tsarin Nahawu (POS)
Masu rarrabe masu horarwa da aka horar akan matsayin ɓoye na samfurin na iya hasashen daidai alamar sashi na magana (POS) na kalma. Wannan yana nuna cewa samfurin na matakin haruffa yana riƙe ba kawai bayanan tsarin kalma ba har ma da bayanan tsarin nahawu game da kalmomin da yake sarrafawa, mai yiwuwa an samo su daga mahallin jeri.
4. Muhimmin Gwaji: Takunkumin Zaɓi
Mafi ƙarfin shaida ya fito ne daga gwada sanin samfurin game da takunkumin zaɓi na tsarin kalma na Turanci. Wannan aikin yana tsakanin tsarin kalma da tsarin nahawu. Misali, ƙari "-ity" yakan haɗa da sifofi don samar da sunaye ("active" → "activity"), ba zuwa fi'ili ba ("*runity").
Marubutan sun gwada samfurin ta hanyar kwatanta yuwuwar da ya ba da ƙirar daidai (misali, kammala "active" da "-ity") da wanda ba daidai ba (misali, kammala "run" da "-ity"). Samfurin yana nuna fifiko mai ƙarfi ga haɗuwar harshe da ya dace, yana nuna ya koyi waɗannan takunkumin na zahiri.
Fitaccen Sakamako na Gwaji:
LM na haruffa ya sami nasarar bambanta tsakanin haɗuwar tsarin kalma masu halal da marasa halal tare da babban daidaito, yana tabbatar da cewa yana riƙe ƙa'idodin tsarin kalma da tsarin nahawu fiye da siffar saman.
5. Cikakkun Bayanai na Fasaha & Tsarin Lissafi
Babban tsarin koyo shine ikon LSTM na matsawa tarihin jeri zuwa vector na jihar $\mathbf{h}_t$. Yuwuwar harafi na gaba ana bayar da shi ta: $$p(c_{t+1} | c_{1:t}) = \text{softmax}(\mathbf{W}_o \mathbf{h}_t + \mathbf{b}_o)$$ inda $\mathbf{h}_t = f_{\text{LSTM}}(\mathbf{x}_{c_t}, \mathbf{h}_{t-1})$. "Fahimtar" samfurin game da tsarin kalma da tsarin nahawu an riƙe su a cikin sigogin LSTM ($\mathbf{W}_f, \mathbf{W}_i, \mathbf{W}_o, \mathbf{W}_c$, da sauransu) da matrices na tsinkaya, waɗanda aka inganta don rage asarar giciye akan hasashen haruffa.
Gwaje-gwajen bincike sun haɗa da horar da masu rarrabe masu sauƙi (misali, koma bayan logistic) akan wakilcin matsayi na ɓoye daskararre $\mathbf{h}_t$ don hasashen alamun harshe na waje (misali, "shin wannan iyakar kalma ne?"), yana bayyana abin da aka riƙe a cikin waɗannan jihohin ta hanyar layi.
6. Sakamako & Fassara
Sakamakon gaba ɗaya ya zana hoto mai gamsarwa:
- Gano Iyaka: Kasancewar takamaiman "raka'ar iyaka" yana ba da tsari bayyananne, mai fassara don gano raka'a.
- Gama Gari Mai Samarwa: Samfurin yana amfani da ƙa'idodi ga sabbin abubuwa, yana hana ƙwaƙwalwar tsari kawai.
- Wayewar Tsarin Nahawu: Bayanan POS an riƙe su, yana ba da damar ayyuka masu hankali ga tsarin nahawu.
- Haɗin Tsarin Kalma da Tsarin Nahawu: Nasara akan ayyukan takunkumin zaɓi yana nuna samfurin yana haɗa ilimin tsarin kalma da tsarin nahawu.
Iyaka da aka Lura: Marubutan sun yarda samfurin wani lokaci yana yin ƙididdiga mara daidai, yana nuna abubuwan da ya koya ba cikakke ba ne na ƙwarewar harshe na ɗan adam.
7. Tsarin Bincike & Misalin Lamari
Tsari: Takardar tana amfani da tsarin bincike mai yawa: 1. Bincike Mai Samarwa: Gwada amfani mai samarwa (misali, kammala sabon kalma). 2. Binciken Mai Rarrabe na Bincike: Horar da samfuran mataimaka akan matsayi na ɓoye don hasashen siffofin harshe. 3. Binciken Raka'a: Duba tsarin aiki na neurons ɗaya ɗaya da hannu.
Misalin Lamari - Bincike don "-ity": Don gwada sanin ƙari "-ity", tsarin zai: 1. Cire matsayin ɓoye $\mathbf{h}$ bayan sarrafa tushe (misali, "active"). 2. Yi amfani da mai rarrabe na bincike akan $\mathbf{h}$ don hasashen ko tsarin kalma na gaba shine ƙari mai samar da suna. 3. Kwatanta yuwuwar samfurin $p(\text{'ity'} | \text{'active'})$ da $p(\text{'ity'} | \text{'run'})$. 4. Bincika aikin "raka'ar iyaka" a ƙarshen tushe don ganin ko yana nuna alamar iyakar tsarin kalma da ya dace don samo asali.
8. Ra'ayin Mai Bincike: Fahimta ta Asali & Zargi
Fahimta ta Asali: Wannan takarda tana ba da darasi mai zurfi a cikin tambayoyin samfurin. Ta wuce ma'aunin aiki don tambaya *abin da* aka koya da *yadda*. Gano "neuron na iyaka" yana da kyau musamman—wani lokaci ne na fassarar tsari mai haske a cikin cibiyar sadarwa mai zurfi. Aikin yana jayayya da gamsarwa cewa LSTMs na haruffa ba masu daidaita tsari kawai ba ne amma suna iya jawo rukunin harshe na zahiri daga sigina na rarraba, suna goyan bayan da'awar da aka yi a cikin aikin da aka yi amfani da shi a baya kamar Tsarin Fassarar Injina na Byte na Lee et al. (2016).
Kwararar Ma'ana: Hujjar an gina ta sosai: daga lura da gama gari mai samarwa (abin da) zuwa gano raka'ar iyaka (wata yuwuwar "yadda"), sannan tabbatar da cewa yana bayyana koyon tsarin kalma, kuma a ƙarshe gwada iyawa mai rikitarwa, haɗe-haɗe (takunkumin zaɓi). Wannan tabbatarwa ta mataki-mataki tana da ƙarfi.
Ƙarfi & Kurakurai: Ƙarfi: Tsananin tsari a cikin bincike; shaida mai gamsarwa, mai fassara (raka'ar iyaka); magance tambaya ta asali a cikin fassarar NLP. Kurakurai: Iyakokin sun iyakance ga Turanci, harshe mai sauƙin tsarin kalma kuma kusan daidaitawa tsakanin sarari da iyakokin kalma. Ƙa'idar ƙarshe—"lokacin da tsarin kalma ya yi daidai da kalmomin harshe"—yana da mahimmanci. Wannan mai yiwuwa ya rushe ga harsunan da ke da haɗakar kalma (misali, Turkanci, Finnish) ko harsunan rubutu ci gaba. "Warewar" samfurin na iya zama mai ƙarfi ta hanyar al'adun rubutu, wani batu da ba a jaddada shi ba. Kamar yadda aka lura a cikin albarkatun kamar ACL Anthology akan ƙirar tsarin kalma, ƙalubalen ya bambanta sosai a cikin harsuna.
Fahimta Mai Aiki: Ga masu aiki: 1) Samfuran matakin haruffa *suna iya* riƙe tsarin harshe, suna tabbatar da amfani da su a cikin ƙarancin albarkatu ko yanayi mai wadatar tsarin kalma—amma tabbatar da harshenku. 2) Tsarin binciken shine tsarin don duba iyawar samfurin. Ga masu bincike: Takardar ta kafa ma'auni don aikin fassara. Hanyoyin gaba dole ne su gwada waɗannan binciken a cikin harsuna iri-iri na nau'in rubutu kuma a cikin samfuran haruffa na zamani na Transformer (misali, ByT5). Fannin dole ne ya tambaya ko abin ban sha'awa a nan samfur ne na musamman na Turanci ko iyawa gabaɗaya na samfuran jeri.
A zahiri, Kementchedjhieva da Lopez sun ba da shaida mai ƙarfi don ware harshe a cikin LSTMs na haruffa, amma su ma a ɓoye suna zana iyakokin wannan warewa. Yana da mahimmanci wanda ke tura al'umma daga fahimta zuwa shaida.
9. Aikace-aikace na Gaba & Hanyoyin Bincike
- Harsuna Masu Ƙarancin Albarkatu & Masu Wadatar Tsarin Kalma: Samfuran haruffa/ƙasa da kalma waɗanda ke koyon tsarin kalma a ciki na iya rage dogaro akan masu nazarin tsarin kalma masu tsada ga harsuna kamar Larabci ko Turkanci.
- Ingantaccen Fassarar Samfurin: Dabarun gano "neurons na aiki" kamar raka'ar iyaka ana iya gama gari don fahimtar yadda samfuran ke wakiltar wasu siffofin harshe (lokaci, ƙin yarda, matsayin ma'ana).
- Haɗa Alama da Ƙaramin Alamar AI: Fahimtar yadda samfuran jijiyoyi ke koyon tsari mai banƙyama, kamar ƙa'ida (misali, takunkumin zaɓi) na iya ba da labari ga gine-ginen AI na gauraye.
- Gwajin Ƙarfi: Yin amfani da wannan hanyar bincike ga manyan samfuran harshe (LLMs) na zamani don ganin ko sun haɓaka irin wannan ko mafi kyawun wakilcin harshe.
- Gama Gari Tsakanin Harsuna: Babban buɗe hanya shine gwada ko waɗannan binciken suna riƙe a cikin harsuna tare da tsarin tsarin kalma daban-daban da rubuce-rubuce, suna motsawa fiye da son rai na Indo-Turai.
10. Nassoshi
- Kementchedjhieva, Y., & Lopez, A. (2018). Alamun cewa tsarin harshe na haruffa yana koyon raka'o'in tsarin kalma da tsarin nahawu na Turanci. arXiv preprint arXiv:1809.00066.
- Chung, J., Cho, K., & Bengio, Y. (2016). Mai fassara matakin haruffa ba tare da rarrabe a sarari ba don fassarar injina ta jijiyoyi. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
- Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Samfuran harshe na jijiyoyi masu wayar da kan haruffa. Proceedings of the AAAI Conference on Artificial Intelligence.
- Karpathy, A. (2015). Tasirin da ba shi da ma'ana na cibiyoyin sadarwa na maimaitawa. Andrej Karpathy blog.
- Lee, J., Cho, K., & Hofmann, T. (2016). Cikakken fassarar jijiyoyi na matakin haruffa ba tare da rarrabe a sarari ba. arXiv preprint arXiv:1610.03017.
- Sutskever, I., Martens, J., & Hinton, G. E. (2011). Samar da rubutu tare da cibiyoyin sadarwa na maimaitawa. Proceedings of the 28th International Conference on Machine Learning.
- Ƙungiyar Kwamfuta ta Harshe (ACL) Anthology. Taskar dijital na takardun bincike a cikin ilimin harshe na kwamfuta da NLP. An samo daga https://aclanthology.org/