Nathan Boudol

My Projects

In this section, you'll find a selection of my projects. The code for the Market Project is available on GitHub. Besides these, I can share any of my other code without issues.
I started coding in Python 6 years ago, and I have completed a few projects (in engineering school and university) in C. This year, we will be doing a 5-week C++ project. I also took SQL classes, so I have the basics.
For the IDE, I mostly use PyCharm, Spyder and VSCode.

The Market Project

I developed a market analysis project using machine learning to predict stock price movements. I implemented various models including Random Forest, Logistic Regression, SVM, KNN, Gradient Boosting. I incorporated backtesting and feature engineering techniques. This project showcases some skills in data processing, API integration, web scraping and model evaluation, while highlighting the complexities and limitations of financial forecasting.

Here is a link to the GitHub repository: Market Project

Metrics MarketProject — ROC Curve. If you want explanations about those metrics,
there is a link to my GitHub.

The Bicycle Project : Nantes

This project is more about analyzing data from a .json file, available here. (I'm also writting a GitHub repository for this one. It will come later. I will be as precise as possible here)

To summarize this project, I separated it in 3 steps :

Collecting and Pre-processing the data : I put them into dataframes, and did things like Hermite interpolation to fill some missing data. There is a figure showing the nature of the data.
Analyzing the data : with powerful tools such as TSFresh, we can extract periodic or pseudo-periodic behaviours, with a lot more features. The main goal is to cluster bike stations.
Conclusions.

I will be a bit concise here because a lot of conclusions can be made from this dataset. There are many ways to quantify traffic in a station.
As a first approach, I considered the variance of available bikes divided its total capacity. I added some criterias I judged "natural". In additon, I used some TSFresh criterias in order to have a (almost) complete caracterisation of the stations.
The main difficulty I had was to normalise (i.e. no unity) and scale values in order to compare them.

Then, with methods such as the elbow method, we determine how many clusters we will consider. After, I applied the K-mean clustering algorithm.

Results of the clustering : Visualisation on a interactive map. With the given criterias, here are the different numbers of the most frequented stations (i.e. the 3rd cluster).
Information about those stations are available online. As expected, there is a correlation between frequently visited stations and points of interest.

BicycleProject — I computed the barycentre of the 3 clusters to see the repartition around it.
It is also possible to display every centers, so we can see the repartition around each clusters.

Quick conclusion : More details on GitHub.
The trivial conclusion is to augment the maximum number of bikes in those frequented stations.
A finer analysis would consist in optimising the capacities of stations in order to minimize the time a station is empty or full, while having to invest the minimum in buying new bikes. We can also think about a better repartition of bikes within the city to optimise traffic.

Bachelor Mémoire

For my Bachelor's degree, I had to write a mémoire on a subject of my choice, as long as it involved math or computer science. I chosed to work on the Continuum Hypothesis and Dana Scott's proof of it.
First, he reformulated the hypothesis using logical symbols within a specific language.
Then, he constructed a mathematical model where the definition of truth is revised, and in which the hypothesis does not hold true—demonstrating its independence.
The proof is long and technical. The fields involved are Logic and Set Theory.

Here is my paper; it is in French: Mémoire.

Project: Summarizing a Text; Graphs, PageRank, TF-IDF... (a GitHub repository is coming for this project as well)

This project processes a document and extracts key sentences using graph-based algorithms. These techniques utilize graph structures to analyze relationships between elements. The document is split into sentences, which are vectorized using TF-IDF to calculate their importance. A cosine similarity matrix is built, creating a graph where each sentence is a node, and edges are weighted by sentence similarity.

Two algorithms are used for summarization:

PageRank ranks sentences based on their relevance in the graph.
Heaviest Path finds the most coherent sequence of sentences.

The summaries are evaluated using ROUGE scores to compare their quality. Additionally, visualizations such as TF-IDF plots, similarity graphs, and PCA help illustrate sentence importance and structure.

Graph Visualization — Nodes are sentences, edge weights are similarities between them.
Red lines correspond to the heaviest path, starting and ending on high-degree nodes.
I display only edges with weight > 0.1, otherwise it is not legible. It is an almost complete graph.

PCA Visualization — PCA: Principal Component Analysis. Here, each unique word is a dimension.
We project onto the two axes with the most significant variance.

TF-IDF Scores — Different TF-IDF scores of the words.
Apart from being useful for the methods, it allows us to instantly identify the main topic(s).

This project can be separated into 3 steps:

Pre-processing the text: We extract sentences, words, and remove common connectors or stop words. The text I used for the example is in the code.
Analyzing the data: We compute scores of words, construct the similarity matrix, the graph, and the PCA.
Extracting sentences for the summary, using both methods.

Quick conclusion: More details on GitHub. Text used here.
There are advantages to using these methods. For example, the graph summary will give us - by construction - only relevant sentences. However, it may miss some important information that is isolated in the document.
Moreover, the PageRank-based summarization works really well with clear and stated themes. Nevertheless, it may not capture the narrative flow and can sometimes be redundant.
I can also add that this kind of summary is extractive, in the sense that we do not create a summary; we simply pick sentences. A more advanced solution would be to generate one from both high-scored sentences and words.

Diffusion, Gradient, Laplacian, PDEs and Other Calculations

Throughout my education and various school or personal projects, I've implemented many mathematical operations and computations to solve problems. Whether it's computing Laplacians, solving differential equations in vector spaces, finding the minimum of multivariable functions, or interpolating functions with polynomials, I have gained a basic but necessary experience.

Description of GIF — Heat source is located in the top left corner – solving the heat equation.
The base image was generated through graph analysis of another image.

This Website

I learned the basics of HTML, CSS, and JavaScript to create this personal portfolio website, designed to show my projects and skills. This project demonstrates my ability to learn new technologies and competences, apply problem-solving skills, and effectively present my work. This site serves as a dynamic resume, highlighting my different projects in data science, mathematics, and computer science while reflecting my commitment to continuous learning and self-improvement.

Below are two interactive graphs that represent classic methods in data science. I'm using a web app hosted on Pythonanywhere to display a Python script on this web page.

Home

Welcome to my website!

My Projects

Book a Lesson

About me:

Education

Scientific Baccalaureate (Engineering Science option, with high honors)

Classes préparatoires

Engineering School

Bachelor's Degree in Mathematics - Computer Science, with honors.

Master of Applied Mathematics

Contact

See you soon