2019 Data Science Bowl

Overview

2019 Data Science Bowl

Uncover the factors to help measure how young children learn

Screenshot

Ignite Possibilities.

Uncover new insights in early childhood education and how media can support learning outcomes. Participate in our fifth annual Data Science Bowl, presented by Booz Allen Hamilton and Kaggle.

PBS KIDS, a trusted name in early childhood education for decades, aims to gain insights into how media can help children learn important skills for success in school and life. In this challenge, you’ll use anonymous gameplay data, including knowledge of videos watched and games played, from the PBS KIDS Measure Up! app, a game-based learning tool developed as a part of the CPB-PBS Ready To Learn Initiative with funding from the U.S. Department of Education. Competitors will be challenged to predict scores on in-game assessments and create an algorithm that will lead to better-designed games and improved learning outcomes. Your solutions will aid in discovering important relationships between engagement with high-quality educational media and learning processes.

Data Science Bowl is the world’s largest data science competition focused on social good. Each year, this competition gives Kagglers a chance to use their passion to change the world. Over the last four years, more than 50,000+ Kagglers have submitted over 114,000+ submissions, to improve everything from lung cancer and heart disease detection to ocean health.

For more information on the Data Science Bowl, please visit www.DataScienceBowl.com

Where does the data for the competition come from?

The data used in this competition is anonymous, tabular data of interactions with the PBS KIDS Measure Up! app. Select data, such as a user’s in-app assessment score or their path through the game, is collected by the PBS KIDS Measure Up! app, a game-based learning tool.

PBS KIDS is committed to creating a safe and secure environment that family members of all ages can enjoy. The PBS KIDS Measure Up! app does not collect any personally identifying information, such as name or location. All of the data used in the competition is anonymous. To view the full PBS KIDS privacy policy, please visit: pbskids.org/privacy.

No one will be able to download the entire data set and the participants do not have access to any personally identifiable information about individual users. The Data Science Bowl and the use of data for this year’s competition has been reviewed to ensure that it meets requirements of applicable child privacy regulations by PRIVO, a leading global industry expert in children’s online privacy.

What is the PBS KIDS Measure Up! app?

Screenshot

In the PBS KIDS Measure Up! app, children ages 3 to 5 learn early STEM concepts focused on length, width, capacity, and weight while going on an adventure through Treetop City, Magma Peak, and Crystal Caves. Joined by their favorite PBS KIDS characters, children can also collect rewards and unlock digital toys as they play. To learn more about PBS KIDS Measure Up!, please click here.

PBS KIDS and the PBS KIDS Logo are registered trademarks of PBS. Used with permission. The contents of PBS KIDS Measure Up! were developed under a grant from the Department of Education. However, those contents do not necessarily represent the policy of the Department of Education, and you should not assume endorsement by the Federal Government. The app is funded by a Ready To Learn grant (PR/AWARD No. U295A150003, CFDA No. 84.295A) provided by the Department of Education to the Corporation for Public Broadcasting.

My Solution 460 Features | Simple | Easy | Less_overfit | Fast

Screenshot

Simple, easy and fast and less overfitting solution with 460 features

This notebook shows problem solving approach using LightGBM Regression and 890 features computed by bruno aquino in the following notebook which are later reduced to 460 features in my approach.

https://www.kaggle.com/braquino/890-features

It also uses the regression coefficients from following notebook by artgor.

https://www.kaggle.com/artgor/quick-and-dirty-regression

Apart from these i also have included resultant LightGBM parameters from exhaustive parameter tuning.

If you find this notebook helpful please press that thumbs up button and thank you :)

PLEASE NOTE THIS IMPORTANT POINT "DON'T BELIEVE IN PUBLIC LB" IT'S ONLY 14% of real data that's private!! We should build a model that's less overfittig and still finding the good results."

Your score will be different for different submissions that's because of randomness in gradient boosting! and that's completely normal you must focus on reducing overfitting, gather as much data as possible and ofcourse reduce the number of features as much as possible without sacrificing model validation score and that's exactly what i've done below :)

Thank you!

Owner
Deepak Nandwani
A Machine Learning and Data Science Engineer, my goal is to make a +ve impact on millions of people's daily lives & to be hyper-optimistic about the future.
Deepak Nandwani
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023
WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Michael Milton 1 Oct 26, 2021
Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

ShawnWang 0 Jul 05, 2022
Data Analytics on Genomes and Genetics

Data Analytics performed on On genomes and Genetics dataset to predict genetic disorder and disorder subclass. DONE by TEAM SIGMA!

1 Jan 12, 2022
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Thailand COVID-19 Cluster Data Extraction About Extract Clusters from Thailand Daily COVID-19 briefing PDF Download latest data Here. Data will be upd

Noppakorn Jiravaranun 5 Sep 27, 2021
Template for a Dataflow Flex Template in Python

Dataflow Flex Template in Python This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build D

STOIX 5 Apr 28, 2022
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

John Koumentis 2 Nov 18, 2022
statDistros is a Python library for dealing with various statistical distributions

StatisticalDistributions statDistros statDistros is a Python library for dealing with various statistical distributions. Now it provides various stati

1 Oct 03, 2021
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is a state-of-the-art platform for statistical modeling and high-

Stan 229 Dec 29, 2022
.npy, .npz, .mtx converter.

npy-converter Matrix Data Converter. Expand matrix for multi-thread, multi-process Divid matrix for multi-thread, multi-process Support: .mtx, .npy, .

taka 1 Feb 07, 2022
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022
A crude Hy handle on Pandas library

Quickstart Hyenas is a curde Hy handle written on top of Pandas API to allow for more elegant access to data-scientist's powerhouse that is Pandas. In

Peter Výboch 4 Sep 05, 2022
A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

Rishikesh S 4 Oct 17, 2022
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 09, 2023
Business Intelligence (BI) in Python, OLAP

Open Mining Business Intelligence (BI) Application Server written in Python Requirements Python 2.7 (Backend) Lua 5.2 or LuaJIT 5.1 (OML backend) Mong

Open Mining 1.2k Dec 27, 2022
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
Python Practicum - prepare for your Data Science interview or get a refresher.

Python-Practicum Python Practicum - prepare for your Data Science interview or get a refresher. Data Data visualization using data on births from the

Jovan Trajceski 1 Jul 27, 2021
Tools for analyzing data collected with a custom unity-based VR for insects.

unityvr Tools for analyzing data collected with a custom unity-based VR for insects. Organization: The unityvr package contains the following submodul

Hannah Haberkern 1 Dec 14, 2022