AI Projects #5
We concluded our 3-month AI Projects #5 journey with an online showcase attended by 72 guests in total. 6 teams and 25 participants showcased the results of their projects, which were from a wide variety of domains including computer vision, natural language processing, audio processing, and speaker identification.
This was the first AI Projects we held fully online due to the precautions taken because of the COVID-19 outbreak, and we were able to come together with AI enthusiasts not only from other cities but also from abroad.
While 3 teams decided to write papers for workshops this batch, our social good project “Traffic Anomaly Detection” held within the framework of CVPR’s NVIDIA AI City Challenge also got good results, which is an exciting accomplishment!
You can watch the showcase here and check out projects GitHub repository here.
PROJECTS
SYNTHETIC MUSIC GENERATION
M. Şafak Bilici, Onur Boyar, Yüşa Ömer Altıntop
This team worked on MuseGAN [1] and JukeBox [2], which are both deep learning models that generate synthetic music. MuseGAN is based on different components, such as the composer model, jamming model, and the hybrid model that combines the two. All the models are different GAN architectures. JukeBox, on the other hand, makes use of another generative model, namely the Vector Quantized Variational Autoencoders (VQ-VAE). The team used the upsampler module to generate music with different parameters and everyone watching the showcase had the chance to listen to some music generated by AI (a normal day at inzva). As a highlight from the demo, the team generated a mixed version of Alice in Chains and over fun lyrics that speak dearly to our heart.
You can hear them here, our favorite is the mix with Nirvana and Alice in Chains!
BUILDING DOMAIN-SPECIFIC LANGUAGE MODEL FOR NLP DOWNSTREAM TASKS
Okan Çiftçi, Sinan Çalışır, Saliha Yeşilyurt, Alaeddin Selçuk Gürel, Ateş Bilgin
The members of this team were all natural language processing enthusiasts, and they all wanted to train a language model, but not just any general-purpose language model, a model that is specific to the medical domain. To achieve their goal, the team chose academic papers published at PubMed and PubMedCentral as their data and put together more than 110 million sentences. They used the ELECTRA-small [3] model for its efficiency and achieved very good results on the named entity recognition task. The team also applied their domain-specific language model on the question answering task. Thanks to their hard work, they received good feedback from everyone in our AI community.
We are very pleased to know that this project will be turned into an academic paper to be submitted to the Clinical Natural Language Processing Workshop held as a part of the EMNLP conference.
ANCHOR-FREE APPROACHES IN 2D OBJECT DETECTION (CORNERNET, CENTERNET)
Burak Teke, Can Bulguoğlu, Berk Kapucu, Berkay Alan, Ayşe Mutlu
In this project, the team examined a novel problem: fast and accurate object detection without using anchors or region proposals. Even though object detectors that utilize anchor boxes usually achieve good results, they have many hyper-parameters to tune, and they are also slow on both training and inference. To overcome these drawbacks, keypoint-based detectors, CornerNet [4], and CenterNet [5] were introduced.
These detectors perform very well by replacing the anchor boxes with a pair of corner keypoints (CornerNet) and a triplet of keypoints (CenterNet). The team worked on familiarizing themselves with the subject and gained a great deal of knowledge when they are implementing the keypoint-based detectors. They presented the subject and their achievements at the showcase and received great feedback and comments on their informative presentation.
LOGOGRAM LANGUAGE GENERATOR
Selim Şeker, Güldeste Selen Dal, Yavuz Durmazkeser, Nurullah Cebeci
AI Projects #5 hosted a team of undergraduate students with outstanding skills, who might have tackled one of the most interesting projects we have seen until this date.
An idea based on the movie “Arrival”, this team wanted to build a deep learning model that generates a completely original logogram language, such as the alien language in the movie. They wanted to achieve this by using multilingual word embeddings that contain similar word embeddings for the same words that are in different languages. They fed these embeddings to a generative model (variational autoencoder) that they trained from scratch using the Omniglot dataset, containing 1623 handwritten characters from 50 different alphabets. In the end, the team was able to demonstrate the generated written characters by inputting words or sentences to a web-based interface.
TRAFFIC ANOMALY DETECTION
Abdullah Baş, Buse Kabakoğlu, Halil Eralp Koçaş, Uras Mutlu
The journey of the traffic anomaly detection project continues in this AI Projects program as well. For the first time ever in the history of the inzva AI Projects, a team continued their project from a previous batch.
After gaining hands-on experience in the domain of traffic anomaly detection during the previous AI Projects program, instead of reproducing the results of a previous study and trying to improve it, the team started working on an original solution this time. Team members observed that the state-of-the-art solutions to anomaly detection in traffic were not working in real-time. They were all offline solutions that are tailored for the dataset provided in the NVIDIA AI City Challenge. The team wanted to approach the problem from a different perspective and decided to go for a solution that works in real-time. Through background modeling, using YOLOv5 [6] as a fast object detector and also using DeepSORT [7] as a tracker, the team was able to develop their own algorithm with a variance-threshold based approach. The team submitted their results and achieved a 0.61 F1 score, which is very close to the third-place in the 2020 NVIDIA AI City Challenge. They also performed a demonstration in the showcase, presenting the real-time dynamics of their model.
SPEAKER IDENTIFICATION
Burak Bekci, Sadık Ekin Özbay, Emre Yüksel, Berkay Özerbay
In this project, the team aimed to identify a speaker by their voice. The team wanted to challenge themselves by implementing the baseline model from scratch. They used VCTK and VoxCeleb datasets, and Deep CNNs with Self-Attention [8] as the baseline model. Their model consists of feature extraction, a CNN architecture, temporal pooling, and self-attention for modeling the sequence data. They achieved very good accuracy on VCTK with 96.4%, yet they encountered an overfitting problem with VoxCeleb, still achieving 73% accuracy.
The team also prepared a fun demonstration of their model, by recording their voice and finding the closest celebrity voice to their own. The model of the demonstration works by extracting the features of the recorded voice, input the features to the CNN model to get an embedded representation, and then finding the closest neighbor to that embedding from the celebrity voice database consisting of already obtained embeddings.
REFERENCES
Dong, H. W., Hsiao, W. Y., Yang, L. C., & Yang, Y. H. (2017). Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. arXiv preprint arXiv:1709.06298.
Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A., & Sutskever, I. (2020). Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341.
Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 734-750).
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision (pp. 6569-6578).
Wojke, N., Bewley, A., & Paulus, D. (2017, September). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645-3649). IEEE.
An, N. N., Thanh, N. Q., & Liu, Y. (2019). Deep CNNs with self-attention for speaker identification. IEEE Access, 7, 85327-85337.
Our AI Projects will be taking new applications for the spring term starting from January 7 and the new batch will take off on February 6 !
inzva is supported by BEV Foundation, an education foundation for the digital native generation which aims to build communities that foster peer-learning and encourage mastery through one-to-one mentorship.
Subscribe to our newsletter here and stay tuned for more hacker-driven activities.