I.J.InformationTechnologyandComputerScience,2018,8,11-17PublishedOnlineAugust2018inMECS()DOI:10.5815/ijitcs.2018.08.02Copyright©2018MECSI.J.InformationTechnologyandComputerScience,2018,8,11-17AutomaticSpokenLanguageRecognitionwithNeuralNetworksValentinGazeauDepartmentofComputerScienceatSamHoustonStateUniversity,Huntsville,TX,USAE-mail:vcg006@shsu.eduCihanVarolDepartmentofComputerScienceatSamHoustonStateUniversity,Huntsville,TX,USAE-mail:cxv007@shsu.eduReceived:22June2018;Accepted:15July2018;Published:08August2018Abstract—TranslationhasbecomeveryimportantinourgenerationaspeoplewithcompletelydifferentculturesandlanguagesarenetworkedtogetherthroughtheInternet.NowadaysonecaneasilycommunicatewithanyoneintheworldwiththeservicesofGoogleTranslateand/orothertranslationapplications.Humanscanalreadyrecognizelanguagesthattheyhavepriorybeenexposedto.Eventhoughtheymightnotbeabletotranslate,theycanhaveagoodideaofwhatthespokenlanguageis.ThispaperdemonstrateshowdifferentNeuralNetworkmodelscanbetrainedtorecognizedifferentlanguagessuchasFrench,English,Spanish,andGerman.ForthetrainingdatasetvoicesampleswerechoosedfromShtooka,VoxForge,andYoutube.Fortestingpurposes,notonlydatafromthesewebsites,butalsopersonallyrecordedvoiceswereused.Attheend,thisresearchprovidestheaccuracyandconfidencelevelofmultipleNeuralNetworkarchitectures,SupportVectorMachineandHiddenMarkovModel,withtheHiddenMarkovModelyieldingthebestresultsreachingalmost70percentaccuracyforalllanguages.IndexTerms—HiddenMarkovModel,LanguageIdentification,LanguageTranslation,NeuralNetworks,SupportVectorMachine.I.INTRODUCTIONThereareroughly6,500spokenlanguagesavailableintheworldtoday.Whilesomeofthemarepopular,i.e.Chinesespokenbyover1Billionpeople,thereareotherssuchasLiki,Njerepwhichareonlyusedbylessthan1,000people.Definitely,understandingthespokenlanguageandcorrespondingaccordinglywhenneededarevitaltocreatecommunicationbetweendifferentbackgroundandlanguagespeakingpeople.Spokenlanguagerecognitionreferstotheautomaticprocessthatdeterminestheidentityofthelanguagespokeninaspeechsample.Thistechnologycanbeusedforawiderangeofmultilingualspeechprocessingapplications,suchasspokenlanguagetranslationand/ormultilingualspeechrecognition[1].Inpractice,spokenlanguagerecognitionisfarmorechallengingthantext-basedlanguagerecognitionbecausethereisnoguaranteethatamachineisabletotranscribespeechtotextwithouterrors.Weknowthathumansrecognizelanguagesthroughaperceptualorpsychoacousticprocessthatisinherenttotheauditorysystem.Therefore,thetypeofspeechperceptionthathumanlistenersuseisalwaysthesourceofinspirationforautomaticspokenlanguagerecognition[2].ThispaperoutlineshowNeuralNetworkscanbetrainedviaSupportVectorMachineandHiddenMarkovModeltoautomaticallyidentifythelanguagedirectlyfromspeechusinglibrariessuchasTensorFlow[3].Thiscanbedoneinmanydifferentwaysastheresultsdependonanumberoffactors:forexamplethediversityandsizeofthetrainingdata,and/ortheNeuralNetworkmodel.Italsodemonstrateshowthetrainingdatawasgathered,theproblemsfacedandhowtestingwasconducted.Thepaperisorganizedasfollows:SectionIIintroducespreviousattemptsatautomaticspokenlanguageidentificationusingphonemeclassifiers,statisticalrecurrentNeuralNetworks,anddeeplearningapproaches.SectionIIIcoverswhatmodelswerechosentoimplementtheNeuralNetworkaswellashowthetrainingdatawasgatheredandhowthetrainingwasconducted.SectionIVdetailshowthetestingwasperformedandhowtheperformanceofthemultiplemodelswereevaluatedaswellastheresults.TheconclusionisdrawnoutinSectionValongwiththefutureimprovementsthatcanbemade.YourgoalistosimulatetheusualappearanceofpapersinaJournaloftheAcademyPublisher.Wearerequestingthatyoufollowtheseguidelinesascloselyaspossible.II.RELATEDWORKSSeveralattemptshavealreadybeenmadetorecognizespokenlanguageswithdifferentdatasets.WhileNeuralNetworkswasusedforvarietyofidentification/classificationproblems[4,5],thissectioncoverssomeofthemostpromisingrecentworksinautomaticspokenlanguageidentificationusingNeuralNetworks:Srivastavaetal.[6]usealanguage12AutomaticSpokenLanguageRecognitionwithNeuralNetworksCopyright©2018MECSI.J.InformationTechnologyandComputerScience,2018,8,11-17independentphonemeclassifierthatextractsthesequenceofphonemesfromanaudiofile.ThephonemesequenceisthenclassifiedusingstatisticalandrecurrentNeuralNetworkmodels(RNNs).Withphonemeclassificationofthreelanguages(Turkish,UzbekandMandarin)authorsachievedanaverageaccuracyof58%.Montavon’s[7]attemptusesadeepneuralnetworkfeaturingthreeconvolutionallayersaswellasasmallerattemptwithasingleconvolutionallayertime-delaymodel.TryingtoclassifyEnglish,German,andFrench,authorachieves91.3%accuracyforknownspeakers(usedintraining)and80%forunknownspeakers.Leiet.alproposedtwonovelfrontendsforrobustlanguageidentification(LID)usingaconvolutionalneuralnetwork(CNN)trainedforautomaticspeechrecognition(ASR)[8].IntheCNN/i-vectorfrontend,theCNNisusedforgettingtheposteriorprobabilitiesfori-vectortrainingandextraction.Theauthorsevaluatedtheirapproachonheavilylowqualityspeechdata,butstilltheywe