DOR – Data Science Portfolio

abril 8, 2025 Retrieval augmented generation

Full project: RAG (Retrieval-Augmented Generation) I

Retrieval-Augmented Generation, known as RAG, harnesses the capabilities of LLMs (Large Language Models) to offer an effective method for accessing external information. LLMs comprehend the inquiry and leverage the contextual details in the given external materials to formulate a response on the subject. This process essentially bridges the gap between the vast general knowledge of an LLM and the specific, often niche, information contained within a user’s own documents. The aim is to create a

Seguir leyendo

febrero 25, 2022 NLP / Recommender Systems

Using NLP to Create a Recommender System

In the article Using Scikit-Surprise to Create a Simple Recipe Collaborative Filtering Recommender System we developed the simplest recommender system using the scikit-surprise package and saw how to use the built-in algorithms it contains, such as KNN or SVD. I’d like to take my recommender systems practice a step further and attempt to create my own prediction algorithm. Surprise allows you to override its core classes and methods in order to tailor your own algorithm and try to improve

Seguir leyendo

febrero 8, 2022 Recommender Systems

Using Scikit-Surprise to Create a Simple Recipe Collaborative Filtering Recommender System.

Companies all over the world are increasingly utilizing recommender systems. These algorithms can be used by online stores, streaming services, or social networks to recommend items to users based on their previous behavior (either consumed items or searched items). There are several approaches to developing recommendation systems. We can build a recommender system based on the content of the item so that the system recommends similar items to the ones the user usually likes (Content-Based recommender

Seguir leyendo

octubre 7, 2021 Time Series

Forecasting Time Series with Auto-Arima

In this article, I attempt to compare the results of the auto arima function with the ARIMA model we developed in the article Forecasting Time Series with ARIMA (https://www.alldatascience.com/time-series/forecasting-time-series-with-arima/). I made this attempt to see how it works and what the differences are.The parameters selected by auto-arima are slightly different than the ones selected by me in the other article.Auto arima has the advantage of attempting to find the best ARIMA parameters by comparing the

Seguir leyendo

septiembre 14, 2021 Time Series

Forecasting time series with ARIMA

In this post, I’ll attempt to show how to forecast time series data using ARIMA (autoregressive integrated moving average). As usual, I try to practice with «real-world», which can be obtained easily by downloading open data from government websites. I chose the unemployment rate in the European Union’s 27 member countries. The data were obtained from the OECD data portal (https://dataportal.oecd.org/). First of all, I’m going to try to clean up the data, in this

Seguir leyendo

agosto 9, 2021 Classification

Comparing Data Augmentation Techniques to Deal with an Unbalanced Dataset (Pollution Levels Predictions)

Predicting NO2 levels in Madrid While looking for data to develop my data science skills, I came up with the idea of searching open data portals. I wanted to look at actual datasets and find out what they were like. For this purpose, I chose open data from the Madrid Open Data Portal (https://datos.madrid.es/portal/site/egob). I will try to predict NO2 concentration using weather and traffic data. This is not meant to be a definitive prediction

Seguir leyendo

febrero 18, 2021 Deep Learning

Deep Learning: COVID-19 detection in X-Ray with CNN

In this project we develop a Deep Learning detector of Covid-19 in radiographs. For this purpose, we use images from the «Covid-chestxray-dataset» [3], generated by researchers from the Mila research group and the University of Montreal [4]. We also use images of radiographs of healthy and bacterial pneumonia patients extracted from Kaggle’s «Chest X-Ray Images (Pneumonia)» competition [5]. In total, we have a number of 426 images, divided into training (339 images), validation (42 images)

Seguir leyendo

febrero 15, 2021 Visualization

Peace agreements data visualization with Tableu

In this post we show a data visualization made with Tableu with data from Peace Agreements Database and World Bank DataBank . The visualization tries to explain how the different collectives are impacted when they are included in peace agreements world wide and show some examples on how these measures can improve people’s lives in someway.

Seguir leyendo

febrero 14, 2021 NLP

NLP: Opinion classification

Let’s perform some classification methods on the same tripadvisor data as in the post https://www.alldatascience.com/nlp/nlp-target-and-aspect-detection-with-python. In this case we are going to read and preprocess the data again, then we are going to vectorize it in different ways, 1. With TF-IDF vectorizer that creates vectors having into account the frequency of words in a document and the frequency of words in all documents, decreasing weight of the words that appear too often (they can bee

Seguir leyendo

febrero 12, 2021 Clustering

Aggregation methods in R.

In this post we use three clustering methods (kmeans, hierarchical clustering and model based clustering) to evaluate their accuracy. We see how to select the optimal number of clusters in each method and obtain metrics to select the best of them.

Seguir leyendo