This project aims to predict whether a real estate property in France was sold for more than €350,000 using property characteristics, geographic information, and socio-economic municipal data sourced from Open Data.
Build a classification model to determine whether a property surpasses the €350,000 sale price threshold. The project emphasizes feature enrichment using public datasets and advanced preprocessing techniques to boost model performance.
- DVF (Demandes de Valeurs Foncières) — real estate transaction data (2021–2022)
- INSEE — income and demographic data per municipality
- IGN GeoJSON — municipal and departmental geometries
- Densité INSEE — rural/urban classification
- Custom dataset — includes geolocated real estate properties and target variable (
sup350k
)
All data is merged using geographic coordinates and INSEE municipal codes.
- Geospatial data integration: Assigns each property to a municipality using spatial joins and nearest-neighbor correction for coastal/missing properties.
- Municipality-level feature engineering: Income median, population density, price per square meter evolution, etc.
- Missing data imputation: KNN-based imputation for numerical features.
- Custom ratios: Surface per room, surface per bathroom, household ratios, etc.
- Modeling: Random Forest, CatBoost, XGBoost, and Stacking Classifiers.
Model | ROC-AUC Score |
---|---|
CatBoost | ~0.96 |
Random Forest | ~0.975 |
Stacking Ensemble | 0.976 |
XGBoost | 0.98 |
- Property size compared to the average size in the area (added)
- Average price per square meter in the municipality (added)
- Average real estate transaction price in the municipality (added)
- Total surface area of the property
- Energy performance score of the property
- Average surface per bathroom (computed)
- Number of bedrooms
- Median income of households in the municipality (added)
- Total number of rooms in the property
- Proportion of apartments in the municipality (added)
These top features highlight the importance of local real estate trends, socio-economic context, and intrinsic property characteristics when predicting whether a property exceeds €350,000 in value.
- Choropleth map of property prices per department
- Heatmap of high-value properties
- Filterable map (Folium) of listings by property type
- Correlation matrix and feature importance bar plots
- Python (pandas, geopandas, scikit-learn, xgboost, catboost, seaborn, matplotlib, folium)
- Jupyter Notebooks for exploration and modeling
- Open Data sources from INSEE, Etalab, IGN, etc.
communes.ipynb
: Data collection and feature engineering using municipal-level datamain.ipynb
: Model training, evaluation, and visualization