Zaɓi Harshe

DVAGen: Tsarin Haɗin Kai don Ƙirƙirar Harshe Mai Ƙarfafawa da Ƙamus Mai Sauyi

DVAGen tsarin buɗaɗɗen tushe ne don horarwa, tantancewa, da kuma ganin ƙirƙirar harshe mai ƙarfafawa da ƙamus mai sauyi, yana magance matsalolin kalmomin da ba a sani ba da inganta ƙarfin fassara.
learn-en.org | PDF Size: 0.8 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - DVAGen: Tsarin Haɗin Kai don Ƙirƙirar Harshe Mai Ƙarfafawa da Ƙamus Mai Sauyi

1. Gabatarwa

Manyan Tsarin Harshe (LLMs) galibi ana horar da su da ƙamus mai tsayayye, wanda ke iyakance ikonsu na yin amfani da sabbin kalmomi ko kalmomin da ba a sani ba (OOV) da kuma sarrafa haɗe-haɗen alamomin harshe cikin inganci. Wannan ƙuntatawa tana da matsala musamman ga aikace-aikacen da suka shafi fannoni na musamman, yanayin harsuna daban-daban, da harsuna masu ci gaba. Duk da cewa an gabatar da hanyoyin ƙamus mai sauyi don rage wannan matsala, hanyoyin da ake da su sau da yawa ba su da haɗin kai, ba su da goyon baya ga LLMs na zamani, kuma suna fama da ƙarancin ƙarfin fassara.

Don rufe wannan gibi, mun gabatar da DVAGen (Ƙirƙirar Harshe Mai Ƙarfafawa da ƙamus Mai Sauyi), cikakken tsarin buɗaɗɗen tushe wanda aka tsara don ci gaban ƙarshen-zuwa-ƙarshe na manyan tsarin harshe masu ƙarfafawa da ƙamus mai sauyi. DVAGen yana ba da haɗaɗɗun kayan aiki don horarwa, tantancewa, da kuma ganin sakamakon ƙirƙira a lokacin gaskiya, yana goyan bayan haɗin kai tare da LLMs na zamani na buɗaɗɗen tushe kuma yana da ingantattun ƙarfin fassara na batch.

2. Bayanan Baya & Ayyukan Da Suka Danganta

Hanyoyin rarrabe kalmomi na gargajiya kamar Byte-Pair Encoding (BPE) da WordPiece sun dogara da ƙamus mai tsayayye, wanda ke sa su zama marasa sassauƙa bayan horo. Haɓakawa kamar Rarrabe Kalmomi Da Yawa (MWT) suna faɗaɗa ƙamus tare da n-grams masu yawa amma har yanzu suna tsayayye. Hanyoyin da aka ƙarfafa da dawo da bayanai, kamar RETRO da tsarin Copy-is-All-You-Need (CoG), suna gabatar da abubuwa masu sauyi ta hanyar dawo da sassan da suka dace ko jimloli yayin ƙirƙira. Duk da haka, waɗannan hanyoyin sau da yawa suna haɗa da hanyoyi masu rikitarwa, matakai da yawa, suna haifar da jinkiri mai yawa, kuma galibi an tabbatar da su akan gine-ginen tsoho kamar GPT-2, ba su da tabbaci akan haɗin kai tare da LLMs na zamani.

3. Tsarin DVAGen

An gina DVAGen a matsayin tsarin module mai faɗaɗawa don magance iyakokin ayyukan da suka gabata.

3.1. Tsarin Gini na Asali & Ƙirar Module

Tsarin ya raba mahimman sassa—mai rarrabe kalmomi, mai dawo da bayanai, mai ƙididdige maki, da mai ƙirƙira—zuwa modules masu zaman kansu. Wannan tsarin module yana ba masu bincike da masu haɓakawa damar keɓancewa ko musanya sassa (misali, gwada hanyoyin dawo da bayanai daban-daban ko ayyukan ƙididdige maki) ba tare da sake fasalin dukan tsarin ba. Yana ɗaukar falsafar shiga-kuma-kunna don haɗa LLMs na buɗaɗɗen tushe da ake da su.

3.2. Hanyar Horarwa & Fassara

DVAGen yana goyan bayan cikakkiyar hanya: train don daidaita samfuran da ke da ƙarfin ƙamus mai sauyi, chat don ƙirƙirar hulɗa, da eval don cikakken tantance aiki akan ma'auni na yau da kullun.

3.3. Kayan Aikin CLI & WebUI

Bambance-bambancen mahimmanci shine samar da duka Kayan Aikin Tsarin Umarni (CLI) don rubutun aiki da sarrafa kai da kuma Yanar Gizo Mai Amfani (WebUI) don duba a lokacin gaskiya da kuma ganin sakamakon ƙirƙira, gami da yanke shawara a matakin alama da amfani da ƙamus mai sauyi.

4. Aiwatar da Fasaha

4.1. Tsarin Ƙamus Mai Sauyi

A cikin ainihinsa, DVAGen yana ƙarfafa daidaitaccen tsinkayen alama mai zuwa na LLM. Yayin ƙirƙira, don wani mahallin $C_t$, tsarin yana dawo da jerin jimlolin da za a iya zaɓa $P = \{p_1, p_2, ..., p_k\}$ daga tushen ilimi. Kowane ɗan takara $p_i$ ana ƙididdige shi ta hanyar aiki $S(p_i | C_t)$, wanda zai iya dogara ne akan yuwuwar LLM, ma'aunin da aka koya, ko makin kamanceceniya na dawo da bayanai. Yuwuwar ƙirƙira ta ƙarshe haɗuwa ce ta rarraba ƙamus na yau da kullun da rarraba ɗan takara mai sauyi:

$P(w | C_t) = \lambda \cdot P_{LM}(w | C_t) + (1 - \lambda) \cdot \sum_{p_i \in P} S(p_i | C_t) \cdot \mathbb{1}(w \in p_i)$

inda $\lambda$ shine ma'aunin daidaitawa kuma $\mathbb{1}$ aikin nuni ne.

4.2. Ingantaccen Fassara na Batch

Yin amfani da ƙarfin matsawa jeri na jimloli masu sauyi (ƙirƙira jimla cikin mataki ɗaya idan aka kwatanta da alamomi da yawa), DVAGen yana aiwatar da ingantaccen fassara na batch. Ta hanyar sarrafa jerin shigarwa da yawa a lokaci guda da ingantaccen tara ayyukan dawo da bayanai da ƙididdige maki don ɗan takara mai sauyi, yana inganta ƙarfin aiki sosai idan aka kwatanta da sarrafa shigarwa guda ɗaya a jere, yana magance babban aibi na ƙarfi a cikin hanyoyin ƙamus mai sauyi da suka gabata.

5. Sakamakon Gwaji & Tantancewa

Takin ya tabbatar da DVAGen akan LLMs na zamani (misali, jerin LLaMA). Manyan binciken sun haɗa da:

  • Rage Rudani: Samfuran da aka ƙarfafa da DVAGen suna nuna raguwar rudani akan saitin gwaji mai ɗauke da sharuɗɗan OOV da kuma ƙamus na musamman na fanni, suna nuna ingantaccen ƙarfin ƙirar harshe.
  • Gudun Fassara: Goyon bayan fassara na batch yana haifar da inganta ƙarfin aiki sau 3-5 idan aka kwatanta da fassarar ƙamus mai sauyi mara batch, tare da ƙaramin tasiri akan ingancin ƙirƙira.
  • Amfanin Ganin Bayanai: WebUI yana haskaka yadda ya kamata lokacin da kuma waɗanne abubuwan ƙamus mai sauyi ake amfani da su, yana ba da bayyananniyar tsarin yanke shawara na samfurin. Hoto na 1 a cikin takardar yana kwatanta kwatancen gefe-da-gefe na ƙirƙira na yau da kullun da na DVAGen-ƙarfafa, yana nuna maye gurbin alamomin ƙananan kalmomi da yawa tare da jimla ɗaya, da aka dawo da ita mai musamman ga fanni.

6. Tsarin Bincike & Nazarin Lamari

Fahimtar Asali: DVAGen ba wani kayan aiki kawai ba ne; wasa ne na tsarin tsarin dabaru. Babban cikas a cikin AI ba girman samfuri kawai ba ne, amma tsaurin ƙamus. Ta hanyar ɗaukar ƙamus a matsayin albarkatu mai sauyi, mai dawo da bayanai maimakon kayan aiki mai tsayayye, DVAGen yana kai hari ga aibi na asali a cikin ƙirar LLM na yanzu—rashin iyawarsu na koyon sabbin kalmomi bayan horo. Wannan yana kwatanta juyin halitta a cikin hangen nesa na kwamfuta daga tacewa mai tsayayye zuwa hanyoyin kulawa masu sauyi, kamar yadda aka gani a tasirin tsarin Transformer idan aka kwatanta da hanyoyin haɗaɗɗun hotuna na farko.

Kwararar Ma'ana: Ma'anar tsarin tana da kyau sosai: 1) Amince da matsalar ƙamus mai tsayayye, 2) Raba mafita zuwa ilimin da za a iya dawo da shi (jimloli) da tsarin ƙididdige maki/zaɓi, 3) Rarraba komai don sassauƙa, da 4) Ƙirƙira don ma'auni (fassara na batch). Yana bin nasarar littafin wasa na buɗaɗɗen tushe na ayyuka kamar Hugging Face's Transformers—ba da tsarin famfo, bari al'umma su gina gidaje.

Ƙarfi & Aibobi: Babban ƙarfinsa shine haɗin kai da aiki. Samar da duka CLI da WebUI babban nasara ne don karɓuwa, yana biyan buƙatun masu bincike da injiniyoyi. Mayar da hankali kan fassara na batch amsa kai tsaye ce ga ciwon kai na farko na samfuran ilimi na baya. Duk da haka, aibin yana cikin dogaro na asali akan inganci da jinkiri na tushen dawo da bayanai. Kamar yadda binciken ƙirƙira da aka ƙarfafa da dawo da bayanai (RAG) ya nuna, kamar na Facebook AI Research (FAIR) akan samfurin Atlas, mummunan dawo da bayanai na iya lalata aiki fiye da taimako. DVAGen a halin yanzu yana kauce wa matsalar "cikakkiyar dawo da bayanai," yana tura shi ga mai amfani.

Fahimta Mai Aiki: Ga kamfanoni, aikace-aikacen nan take yana cikin fannoni masu canzawan sharuɗɗa—biotech (sabbin sunayen magunguna), kuɗi (ƙirƙirar gajarta), shari'a (sharuɗɗan da suka shafi shari'a). Ai wadatar da Layer DVAGen a saman hanyar LLM da kuke da ita don nasara cikin sauri a cikin daidaita fanni. Ga masu bincike, tsarin wurin gwaji ne: gwada ayyukan ƙididdige maki daban-daban $S(p_i | C_t)$. Ƙididdigar maki na yanzu dangane da yuwuwar ba ta da hankali; haɗa masu ƙididdige maki masu koyi, masu fahimtar mahallin zai iya zama ci gaba na gaba.

Nazarin Lamari - Ƙirƙirar Taƙaitaccen Bayanin Kimiyyar Rayuwa: Yi la'akari da ƙirƙirar taƙaitaccen bayani don sabon kwayoyin halitta, "CRISPRaX," wanda ba a san shi ba ga tushen LLM. Daidaitaccen samfuri zai iya fitar da gutsuttsuran alamomi: "CRI", "SP", "Ra", "X". Mai dawo da bayanai na DVAGen, wanda aka haɗa da ɗakin karatu na kimiyyar rayuwa, yana ɗauko jimlolin da za a iya zaɓa kamar "CRISPR activation variant," "gene editing complex." Mai ƙididdige maki ya gano "CRISPR activation variant" a matsayin mai dacewa sosai idan aka yi la'akari da mahallin. Sa'an nan mai ƙirƙira ya fitar da jimla mai ma'ana "CRISPR activation variant (CRISPRaX)" kai tsaye, yana inganta ƙarfin magana da daidaito sosai ba tare da sake horar da samfurin ba.

7. Aikace-aikace na Gaba & Hanyoyi

  • Mataimakan AI Na Musamman: Haɗa ƙamus na musamman na mai amfani (sunayen ayyuka, abokan hulɗa na sirri, sha'awar musamman) cikin tattaunawa cikin sauri.
  • Juyin Halittar Harshe a Lokacin Gaskiya: Haɗawa da rafukan bayanai na kai tsaye (labarai, kafofin watsa labarun) don koyon sabbin kalmomi na yare, sharuɗɗan da suka shahara, ko ƙungiyoyin labarai masu fashewa nan take.
  • Faɗaɗa Ƙamus Tsakanin Hanyoyi: Faɗaɗa tsarin fiye da rubutu don dawo da haɗa alamomi ko ra'ayoyi daga hotuna, sauti, ko bayanai masu tsari, yana matsawa zuwa ƙamus mai sauyi na gaske tsakanin hanyoyi.
  • Koyo na Tarayya & Akan Na'ura: Ba da damar sabuntawa masu sauƙi, na gida na ƙamus mai sauyi akan na'urori na gefe don aikace-aikacen masu kula da sirri, inda ainihin samfurin ya kasance mai tsayayye amma ɗakin bayanan jimlolin da za a iya dawo da su ya keɓance akan lokaci.
  • Haɗin Kai tare da Tsare-tsaren Wakili: Haɓaka wakilan AI (misali, waɗanda aka gina akan tsare-tsare kamar LangChain ko AutoGPT) tare da ikon koyon sabbin sunayen kayan aiki, sigogin API, ko abubuwan da suka shafi yanayi cikin sauri yayin aiwatar da aiki.

8. Nassoshi

  1. Radford, A., et al. (2019). Samfuran Harshe Masu Koyon Ayyuka Da Yawa Ba tare da Kulawa ba. OpenAI Blog.
  2. Devlin, J., et al. (2019). BERT: Horon Farko na Transformers Masu Hanyoyi Biyu Masu Zurfi don Fahimtar Harshe. NAACL-HLT.
  3. Borgeaud, S., et al. (2022). Inganta Samfuran Harshe ta hanyar Dawo da daga Tiriliyoyin Alamomi. ICML.
  4. Lan, Y., et al. (2023). Copy-is-All-You-Need: Samfurin Harshe Mai Ƙarfafawa da Dawo da Bayanai don Ƙirƙirar Rubutu Mai Tsayi. arXiv preprint arXiv:2305.11346.
  5. Liu, N., et al. (2024). Ƙirƙirar Harshe Mai Ƙarfafawa da Ƙamus Mai Sauyi don Samfuran Harshe na Furotin. NeurIPS Workshop.
  6. Vaswani, A., et al. (2017). Kulawa Duk Abin da Kuke Bukata. NeurIPS.
  7. Facebook AI Research (FAIR). (2023). Atlas: Koyon Ƙananan Harbi tare da Samfuran Harshe Mai Ƙarfafawa da Dawo da Bayanai. FAIR Publications.
  8. Grattafiori, A., et al. (2024). Iyakokin Rarrabe Kalmomi Mai Tsayayye a cikin NLP na Zamani. Journal of Artificial Intelligence Research.