Innovation Challenge for Development of Machine Aided Translation System

Important Notice

The language data (parallel text in multiple languages) are available with the user agencies as given below:

S.No	User Agency	Language-Pairs	Contact Person	Email
1	NCERT	English-Hindi English-Urdu	Primary: Dr Amarendra Behera Joint Director, CIET Alternate: Dr Rejaul Karim	jointdirector@ciet.nic.in rejaul.karim@ciet.nic.in
2	Vigyan Prasar	English-Hindi English-Bengali	Dr. Rintu Nath Scientist ‘F’ Vigyan Prasar	rnath@vigyanprasar.gov.in

S.No

User Agency

Language-Pairs

Contact Person

NCERT

English-Hindi

English-Urdu

Primary: Dr Amarendra Behera Joint Director, CIET

Alternate: Dr Rejaul Karim

jointdirector@ciet.nic.in

rejaul.karim@ciet.nic.in

Vigyan Prasar

English-Hindi

English-Bengali

Dr. Rintu Nath Scientist ‘F’ Vigyan Prasar

rnath@vigyanprasar.gov.in

The participating teams may approach the contact person as given above for the details of the language data and how to get it. The teams should assess the feasibility to create text corpus from the available and should select the language pair, accordingly.
Election Commission has informed that the material available in public domain on its website may be used for the purpose of creating parallel corpus for this challenge. English-Hindi language pair may be taken up for this user agency.
In addition to the above user-specific content, the participating teams are advised to use the parallel corpus available in Samanantar dataset which is in public domain.

The deadline of submissions for Stage 1 (Ideation Stage) of the Innovation Challenge for development of Machine Aided Translation System (ICMATS), for the Registered participants ended on 15th November 2022.

Guidelines for Submission at Stage-I

At the end of the Stage-I, each participating team must submit a document containing responses to each of the tasks listed out for the Ideation Stage (Annexure-I). The team must provide a cohesive and comprehensive description of the strategy/approach/plan for every task along with algorithms, diagrams wherever needed. The information given must be self-explanatory and self-contained. References must be provided wherever necessary. The responses must be prepared by the team after in-depth study of the recent advances in the area. Copy-pasting of the content is highly discouraged.

In addition to the responses against the tasks, the team must provide the information on the company/start-up in Annexure-II. The supporting documents (pdf format) for the information provided, must be submitted. Curriculum-vitae of those team members who are likely to work if the team is selected for Stage-II, must be attached.

From the queries received from some of the participating teams, it appears that they are not clear about some of the terms used in the list of tasks. A document is attached at Annexure-III to clarify the terms used. If any team is still not clear on any issue, it may contact the Organizing Team at icmats@investindia.gov.in, immediately.

The team must go through the terms & conditions given in the information brochure (also at the innovation challenge webpage) and submit its acceptance to qualify for consideration at the first stage. The attention is drawn to the Para 14 under Terms & Conditions, regarding making the source-code and corpus available in public domain so that these could be utilized by any Indian entity. The teams should not use any proprietary tool which may require license for using the solution.

The evaluation will be done by a committee with members taken from premier academic institutions and industry. The screening will be based on the submitted documents. Only the selected participating teams will be invited for presentation before the Committee. The list of invited teams will be displayed on the Innovation Challenge webpage shortly after the end of the first stage.

Annexure I: Tasks for Ideation Stage

Annexure II: Format for Submission of Information at Ideation Stage

Annexure III: Tasks Explanation

Background

The Prime Minister’s Science, Technology, and Innovation Advisory Council (PM-STIAC) is an overarching Council that facilitates the Office of the Principal Scientific Adviser to the Government of India to assess the status in specific science and technology domains, comprehend challenges in hand, formulate specific interventions, develop a futuristic roadmap and advise the Prime Minister accordingly. In March 2019, the office of the Principal Scientific Adviser (PSA) announced Nine National S&T Missions on the recommendations of the PM-STIAC. One of these nine Missions, ‘Natural Language Translation’ Mission aims to make opportunities and progress in science and technology accessible to all citizens in their mother tongue. Using a combination of machine and human translation, the mission is aimed at enabling access to teaching and research material bilingually – in English and one’s native Indian language.

Under this Mission, several initiatives have already been taken. Pilot projects have been sponsored to a consortium of academic institutions for further research and development in this field. The present Innovation Challenge is targeted at involving industry / start-ups to get machine-aided translation systems developed using open-source translation tools and text corpus available in public domain including those developed under TDIL program of MeitY.

Objective

The Innovation Challenge aims at the development of useable and scalable text-to-text machine-aided translation systems for English to any Indian language (for which adequate parallel text corpus is available) and vice-versa making use of open-source machine translation platforms and text corpus available in public domain. The participating teams may use public domain Indian language corpus called “Samanantar” and the language resources/models/tools available in the public domain. Samanantar has adequate parallel text corpus (more than 50 lakhs) for English and 8 Indian languages viz. Bengali, Hindi, Kannda, Malayalam, Marathi, Punjabi, Tamil and Telugu.

In order to customize the system for the domain of the user agency, the participating team will create additional parallel text corpus using the content made available by the user agencies viz. Election Commission, NCERT, Vigyan Prasar. The language-pairs for which parallel content is available from the user agencies in their respective domains will be displayed on the Innovation Challenge webpage before the ideation phase starts. The teams may tweak the existing translation models to improve the performance using innovative ideas of their own and also current best practices.

Further, the participating teams will also develop/customize necessary tools in the form of a translation workbench to help users to carry out the translation tasks such as making corrections in the machine translated sentences, etc. in a convenient and efficient way. In order to accomplish the task in a limited time frame, the participants are expected to customize/enhance the open-source tools / workbench already available in public domain rather than building such tools from scratch.

Eligibility Criteria (Who can Apply)

An Indian company registered under Companies Act 2013 The term “Indian company” would mean the one with 51% or more shareholding with Indian citizen or persons of Indian origin.
Startups complying with the definition as per the latest notification of DIPP.
Entities which are under the process of registration with an undertaking that they will complete the registration by the time of final submission.

Stages in the Innovation Challenge

STAGE 01

Ideation Stage
1
The participating teams will present their innovative, cutting-edge ideas and approaches for the development of MATS. Up to 10 top teams will be selected at this stage on the basis of the merits and the feasibility of the solution proposed as well as the capacity of the participating team. Each of the shortlisted teams will be given Rs. 2 lakhs at this stage. If the Steering Committee finds that less than ten teams are in the position to develop the MATS, it may recommend less than ten teams.

While submitting the proposal, the team has to select one language pair consisting of English and any Indian language for which adequate parallel text corpus is available in public domain and also parallel text corpus from the user agency for customisation. If the team has capacity to develop the system for another language pair, it can indicate so in the proposal. The Steering Committee will decide the language pair to be taken in consultation with the team.

KEY TASKS
1. Given parallel documents from the specified domains, a description of the team’s strategies to extract parallel sentences and strategies for bi-text mining.
2. Team’s approach for creation of dataset for training presuming NCERT, Election Commission and Vigyan Prasar as the user agencies. The features that will be added to enrich the model output.
3. Strategies adopted to support sentence tokenization and other necessary pre-processing tasks for selected Indian language and English.
4. Team’s plan to handle variation in the document format like multi column document, rich text document like presentations or textual information present in tables, headers, footers in the document.
5. Team’s plan to handle the above when font family is legacy, non-Unicode.
6. Proposal for system architecture and the methods to scale up. Description of the team’s API approach to integrate the system with other systems.
7. Description of the translation evaluation strategy and regression technique so that the incremental translation models can be deployed as more and more data gets collected.
8. Team’s strategy to correct the translation and capture the correction made so that data augmentation and/or translation quality enhancement is done.
9. The tools proposed by the team for translators so that translators can use the team’s interface to correct the sentence. Explanation of the same when translators are handling textual data present as subscript, superscript, numbers, table data.
10. Team’s strategies to deal with special translation preferences of users for a word or group of words.
11. Once a document is translated, the team’s strategy to provide the translated document to translator.
12. Team’s plans on using existing open-source tools in this challenge.
13. Capacity of the Participating Team to deploy and scale the Solution further as per the requirement of the User Agencies.
STAGE 02

Prototype Stage
2
The shortlisted teams from Stage-I will work towards the development of the prototype and make a presentation to the Steering Committee. At most, 3 teams will be shortlisted at this stage. Each of the shortlisted teams will be provided Rs. 12 lakhs. At this stage, the broad deliverables are the following:
- Enriching and augmenting the language data available in the public domain.
- System design and architecture, API design, deployment pipeline, telemetry data. Model training, general engineering, glossary/TMX, translation memory, translating a document in end-to-end fashion, translator User Interface (UI) Workbench for proofreading and sentence correction within the UI, exporting translated document.
KEY TASKS
1. Translators’ workbench
  1. Specification on the file format to start the translation process. Mention specifics of the format like DOCX, PPTX, PDF, TXT, etc.
  2. Process by which translators override the machine generated translation.
  3. Methodology adopted by translators to download/export the translated file and corrections, if needed.
  4. Process of post editing- whether it is within the tool or external. Explanation of the post editing of the exported documents.
2. Sentence tokenization
  1. Process of implementation of sentence tokenization for the selected Indian language and English.
3. Error reporting and feedback on translation quality
  1. Process by which translators provide feedback on the machine translated sentences.
  2. Not all the intended document translations may succeed; how will the team handle such cases in the workbench.
4. Translation engine
  1. The type of translation engine used and the kind of API integration it contains.
  2. The type of pre- and post-editing of sentences supported from a domain perspective. The process of translation of “DO NOT TRANSLATE” phrases like date, time, names, salutation (Mr. Dr. etc.) numbers, etc.
  3. The type of models evaluated by the team and justification of arriving at the used model.
  4. Type of benchmark the team has executed against the available commercial tools for the sentence translation.
  5. Process of regression testing for model improvements.
5. Capture edited sentences or telemetry
  1. Process of capturing various instrumentation data like post edited sentence by translators and the way of leveraging it to improve the team’s model.
  2. Description of instrumentation or telemetry capturing approach.
6. Domain glossary and translation memory
  1. Process and type of domain glossary/TMX created.
  2. Process of enriching the translation memory.
  3. Clarification on scaling these features when the size of each will increase overtime.
7. Translate the file
  1. Demonstration of end-to-end simple document translation of the selected domain.
  2. Plan of handling complex documents like ones with background images, tables, bullets, footers headers.
STAGE 03

Solution Building Stage
3
The shortlisted teams from Stage II will work towards building the solution. Development cost of Rs. 20 lakhs shall be provided post satisfactory completion of the work of this stage. At this stage, the broad deliverables are the following:
- Assuming sentence or paragraph translation is delivered in working condition, the system should have a sentence tokenizer for Hindi and English, a UI tool where translators can correct the translated sentence
- An improved translation engine and various issues captured and rectified
- Enhancement made to analyse post-edited sentences, visualization of metrics
- Enhance sentence memory and glossary to control the translation output. Enhancements in TXM and translation memory, exporting translated document.
KEY TASKS
1. Translators’ workbench
  1. Kinds of enhancements made to make the translators work easy when it comes to saving and proofreading translated sentences.
  2. Interactive translation, where complex sentences can be corrected on the fly by translators' help.
2. Sentence tokenization
  1. Process of extending sentence tokenizer for other Indic languages.
  2. Each domain will have various acronyms, specific phrases that might interfere in tokenization; the process of making it extendable.
3. Error reporting and feedback on translation quality
  1. Process of capturing and comparing individual machine translated sentence against edited sentences. Explanation of strategies developed for the same.
4. Telemetry
  1. Enhancement and metrics generated from edited sentences.
5. Domain glossary and translation memory
  1. Plan to scale these features over time as size increases.
6. Performance and throughput
  1. Presentation of the system performance metrics like the kind of load it can take. Demonstrate the same, by using the regression test cases
7. Document translation
  1. Improvement and enhancements made by the team during this stage.
  2. Demonstration of a simple document translation in end-to-end fashion that illustrates all the functionalities provided to the translator.
  3. Demonstration of a complex document translation in the chosen domain.
  4. Exporting of the final translated document from the team’s solution. Taking a simple and complex document to demonstrate this functionality.

Terms & Conditions

All participants / teams need to fulfil the eligibility criteria to participate in the Challenge.
The Team will be a group of minimum four members and will be representing the company which they are associated with. The prize money will be assigned to the Company.
During the entire cycle of Innovation Challenge, the Team Lead shall be considered as the Single Point of Contact for all engagements & communications. Furthermore, the Team Leader cannot be changed during the course of the Innovation Challenge.
The Team Leader and Participants will be required to use their e-mail ID and Mobile number for the purpose of Team Registration and Account Creation on https://www.investindia.gov.in/innovation-challenge-for-development-of-machine-aided-translation-system for participation in the Innovation Challenge.
For any update regarding the Innovation Challenge, Participants will have to refer to https://www.investindia.gov.in/innovation-challenge-for-development-of-machine-aided-translation-system
All communication between the Innovation Challenge Steering Committee and Team Leader shall happen via the registered e-Mail ID only. This will be the only form of communication and any other forms of communication will not be entertained.
Steering Committee may suggest some modifications (in the solutions proposed by the participating teams) to meet the requirements of the user agencies. From time to time, the Committee may mentor to ensure delivery of effective solutions.
The teams shall not use any proprietary solution or collaborate with companies that have existing proprietary solutions. Such entries, if identified shall be liable for disqualification. An undertaking towards the same will have to be provided by the participating team.
Any outcome of this initiative shall only be consumed by the participating team for the purpose of the Innovation Challenge only.
Teams shall maintain detailed documentation of their solution and different stages of the Innovation challenge, which will be uploaded by the respective teams on https://www.investindia.gov.in/innovation-challenge-for-development-of-machine-aided-translation-system for reference and record purpose. The Innovation Challenge Steering Committee reserves the right to review these documents any time during the program.
Any changes in Approach to the shortlisted solutions during Prototype or Solution Building stages of the Innovation Challenge will undergo deliberations and will require concurrence of the Steering Committee.
Teams are allowed removal/voluntary withdrawal of team members, only once, during the program before prototype stage. Any such step will have to be disclosed to the Innovation Challenge Steering Committee for approval. No other form of team modification will be entertained.
The funding under Innovation Challenge shall be consumed for development of the solution only. The Teams will be required to provide Fund Utilization Certificate before the Next Stage on the date decided & communicated by the Organizing Team.
The participating teams have to put the developed corpus and source code in open domain so that these could be utilized by any Indian entity. The participating teams too can utilize the corpus and source code for their own purposes once the contest is over. The corpus and code will also be made available on the language technology platform being established under Natural Language Translation Mission (NLTM) of Government of India.
The solution should not violate/breach/copy any idea/concept/product already copyrighted, patented or existing in this segment of the market. Anyone found to be non-compliant, may get their participation cancelled.
The solution must have relevant privacy and security features. The solution must comply with the privacy and security laws and other related laws of the country.
Innovation Challenge Steering Committee will take the final call for any unforeseen situation. This includes the matters related to the Samanantar dataset and associated models. If the teams are unable to use the dataset due to any reason or if they find any quality issue, the Organizing Team will examine the matters and issue appropriate directions.
For any dispute arising out of this contest, the decision of the Scientific Secretary, O/o PSA will be the final.
The solution/product so developed would be deployed in the chosen cloud infrastructure / environment as stipulated by the user agency for whom the solution is being developed.
Submissions will be considered void if they are in whole or part ill-eligible, incomplete, damaged, altered, counterfeit, obtained through fraud or late submission.

Timeline

S.No.	Timeline	Activity/Stage	Duration
1	Week 1 - Week 4	Registration	1 Month
2	Week 5 - Week 8	Ideation	1 Month
3	Week 9 - Week 16	Prototype	2 Month
4	Week 17 - Week 24	Solution Building	2 Month

Intellectual Property Rights

The Intellectual Property Rights (IPR) out of this challenge will belong to any agency as designated by PSA Office. The system developed or corpus created as part of the challenge must be kept in public domain for use by startups, researchers, etc. The designated agency will take necessary steps to protect the intellectual property rights.

Further Development and Deployment

If the performance of the solution meets the expectations of the user agency, the team may get an opportunity to develop an enhanced system (may involve other Indic languages) and deploy for the participating user agencies on mutually agreed terms and conditions between the user agency and the team. The decision of the user agency will be final in this matter.