Through my courses, I had the opportunity to work with some fantastic professors. My coursework has enbaled me to deeply interact with concepts of statistics, Machine Learning and techniques for Data Science. I am proficient in R and Python with data management, data wrangling and statistical modelling skills. I interned at M.N Dastur & Co. Pvt Ltd with the Data Engineering and Electronics Department . Before that I worked as a Data Science Intern at MinD Webs Ventures.
. I have taken courses on Probability Distributions & Statistical Inference, Data Wrangling, Linear Regression and Time Series Analysis . I am currently taking Data Management, Machine Learning and Statistical Modelling . Later I intend to take Deep Leanring and NLP courses.
Beyond academics, I am a seasoned Quizzer, a professional EMCEE, avid sports fan and I play video games in my leisure time.
My research interests has traditionally spanned various sub-domains of Data Science and Machine Learning such as Natural Language Processing, Neural Networks, Statistical Modelling
and the like with a focus on designing efficient models for real-world applications. I have also had the opportunity to work
specifically in Long Short Term Memory(LSTMs), Temporal Convolutional Networks for Time Series Forecasting and CNNs for Medical Image Processing . Recently, I have taken a liking to semi-supervised and unsupervised Large Language Models(LLMs) and Responsible AI as well. I am currently doing extensive literature study in the offensive speech detection and hallucination of generative AI
I am always open to new collaborations and research ideas. I am looking for collaborating in research projects
Projects
2024
Automatically Auditing LLMs by Discrete Optimization
Mayukh Sen
Report /
slides /
Code
LLMs is automatically searched for input-output pairs. Audit models by specifying and solving a discrete optimization problem. We look for non-toxic prompt and toxic output pairs and exact matches for politically nuanced prompts by reversing LLMs.
Used Coordinate Ascent (ARCA) for reversing LLM and log probability loss to assess LLM performance.
Using gpt-2, BERT and GPT-J and Civil Comments Dataset to perform experiments and detect toxic outputs.
Observed similar performance in success rate with prompt length over all LLMs denoting the effectiveness and robustness of ARCA for auditing LLMs.
Twitter Search App
Mayukh Sen ,Brandon Salter , Divya Shah , Max Jacobs
Report /
slides /
Code
Built a Search App that retreives Tweets , Retweets and userinfo of users tweeting. Data was Streamed from twitter and loaded ito Mongo DB And Google BigQuery.
Performed Extract, Transform and Load (ETL) operations on the data to clean and preprocess it . Then User Data is stored into MongoDB wheras Tweet and Retweets are stored in Big Query to exploit the relational aspect of Tweets and use a foreign key to relate them.
No Duplicates are made to ensure data consistency in databse. Search function is made using f-strings constructed from user input to build SQL Queries . Search functions made similarly to query JSON objects and arrays in MongoDB.
Cache is implemented to speed up searches. Heap is used in cache to store searches as Key-Value Pair. Significant latency and throughput improvement upon cache implementation.
Neural Networks Interpretability - Merck Challege Mayukh Sen ,Brandon Salter , Rohit Vernekar.
Merck Challenge in association with Rutgers Statistics Department Report /
slides /
Code
Neural Networks are widely considered to be "BlackBox" Models and the interpretability of Deep Learning Models are still an area of active research . In order to better understand the interpretability, my peers and I performed some experiments leveraging existing statistical methods.
We conduct experiments on MNIST Dataset and for purpose of interpretability we are working on CNN model . We also review of pretrained models on MNSIT dataset and perform Thorough analysis and evaluation of the performance on the data. Finetuning of hyperparameters to improve accuracy and performance of models.
Some techniques explored :-
1. Conformal Predictions
2. Integrated Gradients
3. SHAP
We will perform further experiments to get a better understanding than what we have acheived
Leveraging Mistral's AI for peaceful dialogue, this project classifies text to promote understanding. It fine-tunes Mistral-7b on datasets for unity-focused content, offering tools for positive communication across digital spaces.
The project aims to enhance the interpretability of large language models (LLMs) by focusing on the peaceful annotation of text. By training Mistral-7b on a dataset of peaceful dialogue, the model can classify text to promote understanding and unity. This approach enables the identification of positive and constructive communication, fostering a more harmonious online environment.
The project's methodology involves fine-tuning Mistral-7b on a dataset of unity-focused content, which includes positive and peaceful dialogue. This training process enhances the model's ability to recognize and classify text that promotes understanding and unity. By leveraging Mistral's AI capabilities, the project aims to provide tools for positive communication across digital spaces, fostering a more peaceful and harmonious online discourse.
2023
Chicago Crime Dataset Time-Series Analysis and Forecasting using LSTM(RNN) based approach Mayukh Sen , Max Jacobs, Krish Shah
Report /
slides /
code /
dataset
In this comprehensive time series forecasting project, we analyzed the Chicago Crime Dataset spanning from 2008 to 2023, which comprises 22 columns and over 4 million rows. The initial phase of the project involved meticulous preprocessing, including the removal of NA values and irrelevant columns, along with the conversion of data into appropriate datatypes. We also engaged in feature engineering to enhance the dataset's utility for analysis.
Our exploratory data analysis yielded valuable insights through informative plots, heatmaps, and heat matrices, offering a detailed understanding of crime statistics in Chicago. This included identifying trends and seasonal components in the data, as well as conducting spatial analysis to pinpoint crime hotspots in the city.
For forecasting crimes per day in these hotspots, we utilized a range of time series analysis and forecasting models, including ARIMA, SARIMA, TBATS, Holt-Winters, and LTSM. Among these, Holt-Winters, TBATS, and LTSM proved particularly effective in capturing seasonality. Notably, the LTSM model, which featured two layers, delivered superior forecasts on test data, achieving the lowest Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE).
In this survival analysis project, I utilized the METABRIC dataset comprising clinical profiles of 2,509 breast cancer patients to model time-to-event data, focusing on survival and relapse rates. Employing advanced statistical techniques, I conducted exploratory data analysis, preprocessing, and data cleaning, including handling missing values and label encoding. This project featured the implementation of Kaplan-Meier estimators for non-parametric survival function modeling and Cox Proportional Hazards models for semi-parametric analysis, assessing the impact of various covariates on patient survival times. This comprehensive analysis was aimed at deriving insights into factors influencing breast cancer prognosis, showcasing proficiency in survival analysis and data science methodologies.
The analysis focused on the FAIR plan, a default policy offered by the City of Chicago for homeowners, using data compiled by Andrews and Herzberg (1985) and available in the R 'faraway' package. The objective was to investigate how various factors, such as race, frequency of fires, instances of theft, age of housing, income levels, and geographical location (North or South Chicago), impact the rate of policy renewals per 100 housing units.
A comprehensive sensitivity analysis was conducted to ascertain the best-fitting model. This included an examination of outliers using Cook's Distance to ensure data integrity and model reliability. Additionally, a bootstrapping approach was employed to validate the robustness of the analysis.
One of the key findings was the differential impact of race as a factor in policy renewal rates between North and South Chicago. While racial demographics were not a significant predictor of renewal rates in South Chicago, they emerged as a significant factor in North Chicago. This insight highlights the importance of considering geographical nuances in policy analysis and underscores the complexity of factors influencing homeowners' decisions to renew their FAIR plan policies. The study's methodology and findings contribute valuable perspectives to urban policy planning, particularly in the context of insurance renewals and their socioeconomic determinants.
In this project, we developed a sophisticated model to predict auto insurance claim costs based on a historical dataset encompassing 22,000 entries and 22 distinct features. The data was meticulously analyzed using R, which facilitated a thorough exploratory data analysis (EDA). This process entailed comprehensive data diagnostics, insightful visualizations, and strategic feature engineering to optimize the dataset for model training.
We employed the Light Gradient Boosting Machine (Light GBM) model for its efficiency and effectiveness in handling large datasets with high-dimensional features. This choice proved successful as our model demonstrated superior predictive capabilities, outperforming the established benchmark model. Specifically, our model achieved a notable Gini score of 0.19742, surpassing the benchmark's score of 0.1835. This significant improvement in prediction accuracy highlights the model's potential in accurately forecasting claim costs, thereby enabling the creation of a more precise and equitable rating plan for auto insurance claims. This advancement is crucial for insurance companies seeking to enhance their risk assessment strategies and pricing models, ultimately leading to more tailored and fair insurance offerings for their customers.
2022
Identification and Classification of Diabetic Retinopathy Mayukh Sen Agni Sain , Ayush Singh , Himadri Sekhar Dutta Final Year Project B.Tech Report /
slides /
code
In this project we build an application for the identification and classification of diabetic retinopathy from fundus images of human eyes. Achieved an impressive accuracy of 96.1% in the classification of diabetic retinopathy severity levels.
Trained the model while utilizing Convolutional Neural Networks(CNN) like MobileNet2, EfficientNet on Training
dataset to enhance model robustness and predictive power. Employed TensorFlow, and Kerasfor modelling.
Demonstrated proficiency in data preprocessing, augmentation, and transfer learning techniques.
Results deployed on website built on minimalistic UI and linked using Flask API and MongoDB
Teaching
I have been working as a Academic Success Mentor speciliaising in Statistics. I have been an Academic Success Mentor for Thrive Student Support Services in Rutgers Univeristy, New Brunswick. My responsibilities include teaching statistics to ungergraduate students of Rutgers.
I have taken sessions on Probability Distributions,Statistical Inference,hypothesis testing, Confidence Intervals, Likelihood Estimators and Bayesian Analysis etc .
I have designed workshops on R and its useful statistical applications that enabled students to analytically approach statistical problems effectively and ultimately obtain stellar grades in related coursework. Also acted as an Academic Success Mentor and assisted students in
developing better study habits and better understanding of subject matter through one-on-one sessions.
Won 1st position in AI Entrepre-Neural, 2021 by GES, IIT Kharagpur and Intel
This work focuses on two lightweight Traffic sign classification implementations which can predict Traffic signs from any real time video feed.
Here, a model based on an slightly enhanced LeNet architecture has been used and trained on the German Traffic Sign Dataset (GTSD) which has
over 70000 images of traffic signs and over 40 various classes. Our model achieves a validation accuracy of over 98% and a training accuracy
of over 97%. This saved model is then optimized over the Intel OpenVINO Model Optimizer + Inference Engine and run directly for predicting
Traffic signs live from any video source(we have used webcam for our run). We have also provided a non optimized solution for comparison purposes.
*Equal Contribution
Extra-Curriculars
Finished in top 5% of Bloomberg Global Trading Challege and highest ranked team in Rutgers University. Finished in top 50 in a pool of 2000+ teams with a profit of $98,112
I am an articulate quizzer and I make quizzes (trivia contest question sets) for several inter college competitions. I was the captain of my college team and I led my team to regional competition victory multiple times.
Acted as the lead EMCEE(host) and lead event manager for college cultural fest. Hosted concerts in front of 15000+ crowds
Our team EcoGeeks qualified for the Hult 2021 Mumbai regionals. For context, only ~40 teams do so all over India