Zaɓi Harshe

MENmBERT: Koyon Canzawa don Aikin Harshen Turancin Malaysia na NLP

Bincike kan koyon canzawa daga Ingilishi PLMs zuwa Turancin Malaysia don ingantaccen Gane Sunayen Ƙungiya da Cire Dangantaka a cikin ƙarancin albarkatu.
learn-en.org | PDF Size: 0.2 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - MENmBERT: Koyon Canzawa don Aikin Harshen Turancin Malaysia na NLP

Teburin Abubuwan Ciki

26.27%

Ci gaban Aikin Cire Dangantaka (RE)

14,320

Labaran Jaridu a cikin Tarin Bayanai na MEN

6,061

Ƙungiyoyin da aka yi wa lakabi

1. Gabatarwa

Turancin Malaysia yana wakiltar ƙalubale na musamman na harshe a cikin NLP - harshe na creole mai ƙarancin albarkatu wanda ya haɗa abubuwa daga harsunan Malay, Sinawa, da Tamil tare da Daidaitaccen Turanci. Wannan bincike yana magance babban gibi a cikin ayyukan Gane Sunayen Ƙungiya (NER) da Cire Dangantaka (RE) lokacin da ake amfani da daidaitattun samfuran harshe da aka riga an koya su akan rubutun Turancin Malaysia.

Daidaituwar morphosyntactic, siffofin ma'ana, da tsarin canza lambobi na halayyar Turancin Malaysia suna haifar da raguwar aiki mai mahimmanci a cikin samfuran da suka fi dacewa. Aikinmu ya gabatar da MENmBERT da MENBERT, samfuran harshe na musamman waɗanda suke magance wannan gibi ta hanyar dabarun koyon canzawa.

2. Baya da Ayyukan da suka danganci

Daidaituwar samfuran harshe da aka riga an koya su zuwa tarin fannoni ko na harshe na musamman ya nuna gagarumin ci gaba a duk faɗin ayyukan NLP daban-daban. Bincike na Martin et al. (2020) da Antoun et al. (2021) ya nuna cewa ƙarin koyo a kan tarin bayanai na musamman yana haɓaka aikin samfura a cikin yanayin harshe da aka yi niyya.

Turancin Malaysia yana gabatar da ƙalubale na musamman saboda yanayin creole ɗinsa, yana nuna kalmomin aro, kalmomin haɗe-haɗe, da abubuwan da suka samo asali daga harsuna masu yawa. Al'amarin canza lambobi, inda masu magana suke haɗa Turanci da Malay a cikin furuci ɗaya, yana haifar da ƙarin rikitarwa ga daidaitattun samfuran NLP.

3. Hanyar Bincike

3.1 Hanyar Koyon Farko

MENmBERT yana amfani da koyon canzawa daga Ingilishi PLMs ta hanyar ci gaba da koyo a kan Tarin Labaran Turancin Malaysia (MEN). Manufar koyon farko ta bi hanyar samfurin harshe da aka ɓoye:

$$L_{MLM} = -\mathbb{E}_{x \sim D} \sum_{i=1}^{n} \log P(x_i | x_{\\backslash i})$$

inda $x$ ke wakiltar jerin shigarwa, $D$ shine rarraba Tarin Bayanai na MEN, kuma $x_{\\backslash i}$ yana nuna jerin da aka ɓoye alamar $i$-th.

3.2 Dabarar Daidaitawa

An daidaita samfuran akan MEN-Dataset ɗin da ke ɗauke da labaran jaridu 200 tare da ƙungiyoyi 6,061 da aka yi wa lakabi da misalan dangantaka 4,095. Tsarin daidaitawa ya yi amfani da yadudduka na musamman na aiki don NER da RE, tare da ingantacciyar asarar cross-entropy:

$$L_{NER} = -\sum_{i=1}^{N} \sum_{j=1}^{T} y_{ij} \log(\hat{y}_{ij})$$

inda $N$ shine adadin jerin, $T$ shine tsayin jeri, $y_{ij}$ shine ainihin lakabin, kuma $\hat{y}_{ij}$ shine yiwuwar da aka annabta.

4. Sakamakon Gwaji

4.1 Aikin Gane Sunayen Ƙungiya (NER)

MENmBERT ya sami ci gaba gabaɗaya na 1.52% a cikin aikin NER idan aka kwatanta da bert-base-multilingual-cased. Duk da cewa ci gaban gabaɗaya ya bayyana a hankali, cikakken bincike ya nuna gagarumin ci gaba a duk faɗin takamaiman lakabin ƙungiya, musamman don ƙungiyoyin da suka danganci Malaysia da kuma maganganun da aka canza lambobi.

Hoto na 1: Kwatancin aikin NER yana nuna MENmBERT ya fi samfuran tushe a kan nau'ikan ƙungiyoyin da suka danganci Malaysia, tare da kyakkyawan aiki musamman akan wurare da ƙungiyoyi na musamman ga yanayin Malaysia.

4.2 Aikin Cire Dangantaka (RE)

An ga mafi girman ci gaba a cikin Cire Dangantaka, inda MENmBERT ya sami riba na 26.27% a cikin aiki. Wannan babban ci gaba yana nuna ƙarfin ƙarfin samfurin don fahimtar alaƙar ma'ana a cikin yanayin Turancin Malaysia.

Mahimman Bayanai

  • Koyon farko na musamman na harshe yana inganta aiki akan yarukan da ba su da albarkatu
  • Tsarin canza lambobi yana buƙatar gine-ginen samfura na musamman
  • Koyon canzawa daga harsuna masu albarkatu zuwa ƙarancin albarkatu yana nuna sakamako masu ban sha'awa
  • Tarin bayanai da aka mayar da hankali kan yanki suna haɓaka aikin samfura don bambance-bambancen harshe na yanki

5. Tsarin Bincike

Ra'ayin Masanin Masana'antu

Mahimmin Fahimta

Wannan bincike yana ƙalubalantar dabarar daidaitawa ɗaya ga dukkan harsuna na NLP. Ci gaban aikin RE na 26.27% ba kawai ci gaba ne kawai ba - laifin ne na yadda samfuran da suka fi kowa suka gaza a cikin bambance-bambancen harshe da aka keɓe. Turancin Malaysia ba lamari ne na keɓantacce ba; shi ne canary a cikin ma'adinan kwal don ɗaruruwan al'ummomin harshe da ba a bi da su ba.

Matsalar Hankali

Hanyar bincike ta bi matakai uku masu inganci na rugujewar hikimar al'ada: gano gibin aikin (daidaitattun samfura sun gaza sosai), turawa koyon canzawa da aka yi niyya (ginin MENmBERT), da tabbatarwa ta hanyar ingantaccen benchmarking. Hanyar tayi kama da dabarun daidaitawa na yanki da aka ga nasara a cikin NLP na likitanci (Lee et al., 2019) amma tana amfani da su don kiyaye bambancin harshe.

Ƙarfi & Kurakurai

Ƙarfi: Tarin bayanai na labaran 14,320 yana wakiltar ƙoƙarin tsara bayanai mai mahimmanci. Hanyar samfura biyu (MENmBERT da MENBERT) tana nuna ƙwararrun hanyoyin bincike. Ci gaban aikin RE ba za a iya musantawa ba.

Kurakurai: Ci gaban NER na 1.52% a hankali yana ɗaga gira - ko dai ma'aunin kimantawa ba su da kyau ko kuma hanyar tana da iyakoki na asali. Takardar ta yi rawa a kusa da wannan saɓani ba tare da cikakken bayani ba. Dogaron samfurin akan bayanan yanki na labarai yana iyakance haɓakawa.

Bayanai masu Aiki

Ga kamfanoni masu aiki a Kudu maso Gabashin Asiya: la'akari da ɗaukar kai tsaye. Ga masu bincike: kwaikwayi wannan hanyar don Turancin Singapore, bambance-bambancen Turancin Indiya. Ga masu haɓaka samfura: wannan ya tabbatar da cewa "yare da yawa" a aikace yana nufin "manyan harsuna kawai" - lokacin canji.

Misalin Tsarin Bincike

Nazarin Shari'a: Gane Ƙungiya a cikin Rubutun da aka Canza Lambobi

Shigarwa: "Zan je pasar malam a Kuala Lumpur sannan in haɗu da Encik Ahmad a KLCC"

Fitarwar BERT ta Al'ada: [ORG] pasar malam, [LOC] Kuala Lumpur, [MISC] Encik Ahmad, [MISC] KLCC

Fitarwar MENmBERT: [EVENT] pasar malam, [CITY] Kuala Lumpur, [PERSON] Encik Ahmad, [LANDMARK] KLCC

Wannan yana nuna mafi girman fahimtar MENmBERT game da yanayin al'adun Malaysia da nau'ikan ƙungiyoyi.

6. Ayyuka na Gaba

Nasarar MENmBERT ta buɗe hanyoyi masu yawa masu ban sha'awa don bincike da aikace-aikace na gaba:

  • Canja Harshe: Amfani da irin wannan hanyoyin zuwa wasu bambance-bambancen Turanci (Turancin Singapore, Turancin Indiya)
  • Haɗin Nau'i-nau'i: Haɗa rubutu tare da bayanan audio don ingantaccen gano canjin lambobi
  • Aikace-aikace na Ainihi: Turawa a cikin chatbots na sabis na abokin ciniki don kasuwannin Malaysia
  • Fasahar Ilimi: Kayan aikin koyon harshe da aka keɓance ga masu magana da Turancin Malaysia
  • Aikace-aikacen Shari'a da Gwamnati: Sarrafa takardu don takardun shari'a da gudanarwa na Malaysia

Hanyar tana nuna iya aiki zuwa sauran bambance-bambancen harshe masu ƙarancin albarkatu da harsunan creole a duniya.

7. Bayanan da aka ambata

  1. Devlin, J., et al. (2019). BERT: Koyon Farko na Masu Canzawa Masu Zuwa Biyu don Fahimtar Harshe.
  2. Liu, Y., et al. (2019). RoBERTa: Hanyar Koyon Farko ta BERT da aka Inganta.
  3. Conneau, A., et al. (2020). Koyon Wakilcin Harshe mara Kulawa a Girma.
  4. Lan, Z., et al. (2020). ALBERT: BERT mai Sauƙi don Koyon Kai na Harshe.
  5. Martin, L., et al. (2020). CamemBERT: Samfurin Harshen Faransanci mai Daɗi.
  6. Antoun, W., et al. (2021). AraBERT: Samfurin Tushen Canzawa don Fahimtar Harshen Larabci.
  7. Chanthran, M., et al. (2024). Tarin Labaran Turancin Malaysia don Ayyukan NLP.
  8. Lee, J., et al. (2019). BioBERT: samfurin wakilcin harshe na likitanci da aka riga an koya shi.