<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>ML | Nalin Gadihoke</title><link>https://www.nalingadihoke.com/category/ml/</link><atom:link href="https://www.nalingadihoke.com/category/ml/index.xml" rel="self" type="application/rss+xml"/><description>ML</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright>© Nalin Gadihoke, 2020</copyright><lastBuildDate>Tue, 23 Mar 2021 00:00:00 +0000</lastBuildDate><image><url>https://www.nalingadihoke.com/images/icon_huf9971291de093faa6aa59cd65f433195_5940_512x512_fill_lanczos_center_3.png</url><title>ML</title><link>https://www.nalingadihoke.com/category/ml/</link></image><item><title>Case Studies - Stout</title><link>https://www.nalingadihoke.com/post/case-studies-stout/</link><pubDate>Tue, 23 Mar 2021 00:00:00 +0000</pubDate><guid>https://www.nalingadihoke.com/post/case-studies-stout/</guid><description>&lt;p>Welcome. This is the webpage containing my answers to the case studies.&lt;/p>
&lt;p>Feel free to check out my website for a more detailed profile!&lt;/p>
&lt;h2 id="customer-orders">Customer Orders&lt;/h2>
&lt;p>Below is a table containing the desired metrics (scrollable).&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:right">Current Year&lt;/th>
&lt;th style="text-align:right">Previous Year&lt;/th>
&lt;th style="text-align:right">Total Customers Current Year&lt;/th>
&lt;th style="text-align:right">Total Customers Previous Year&lt;/th>
&lt;th style="text-align:right">New Customers&lt;/th>
&lt;th style="text-align:right">Lost Customers&lt;/th>
&lt;th style="text-align:right">Existing Customers&lt;/th>
&lt;th style="text-align:right">Existing Customer Revenue Current Year&lt;/th>
&lt;th style="text-align:right">Existing Customer Revenue Previous Year&lt;/th>
&lt;th style="text-align:right">Existing Customer Revenue Growth&lt;/th>
&lt;th style="text-align:right">Revenue Lost From Attrition&lt;/th>
&lt;th style="text-align:right">Total Revenue Current Year&lt;/th>
&lt;th style="text-align:right">New Customer Revenue&lt;/th>
&lt;th>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:right">2015&lt;/td>
&lt;td style="text-align:right">2014&lt;/td>
&lt;td style="text-align:right">231,294&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">231,294&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">0&lt;/td>
&lt;td style="text-align:right">29,036,749.19&lt;/td>
&lt;td style="text-align:right">29,036,749.19&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:right">2016&lt;/td>
&lt;td style="text-align:right">2015&lt;/td>
&lt;td style="text-align:right">204,646&lt;/td>
&lt;td style="text-align:right">231,294&lt;/td>
&lt;td style="text-align:right">136,891&lt;/td>
&lt;td style="text-align:right">163,539&lt;/td>
&lt;td style="text-align:right">67,755&lt;/td>
&lt;td style="text-align:right">8,524,576.69&lt;/td>
&lt;td style="text-align:right">8,485,533.04&lt;/td>
&lt;td style="text-align:right">39,043.7&lt;/td>
&lt;td style="text-align:right">20,551,216.15&lt;/td>
&lt;td style="text-align:right">25,730,943.59&lt;/td>
&lt;td style="text-align:right">17,206,366.90&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:right">2017&lt;/td>
&lt;td style="text-align:right">2016&lt;/td>
&lt;td style="text-align:right">249,987&lt;/td>
&lt;td style="text-align:right">204,646&lt;/td>
&lt;td style="text-align:right">173,449&lt;/td>
&lt;td style="text-align:right">128,108&lt;/td>
&lt;td style="text-align:right">76,538&lt;/td>
&lt;td style="text-align:right">9,648,282.02&lt;/td>
&lt;td style="text-align:right">9,584,424.96&lt;/td>
&lt;td style="text-align:right">63,857.1&lt;/td>
&lt;td style="text-align:right">16,146,518.63&lt;/td>
&lt;td style="text-align:right">31,417,495.03&lt;/td>
&lt;td style="text-align:right">21,769,213.01&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;code>Note&lt;/code>:&lt;/p>
&lt;p>For the current year of 2015, existing customers and their revenue from the previous year was assumed to be zero. Additionally, revenue lost from attrition was interpreted to be total revenue from previous year from customers lost (no longer in current year).&lt;/p>
&lt;h2 id="fraud-detection">Fraud Detection&lt;/h2>
&lt;p>The dataset has about &lt;code>6.3 million rows&lt;/code> and &lt;code>11 columns&lt;/code> where each row represents a transaction. Column descriptions are available on &lt;a href="https://www.kaggle.com/ntnu-testimon/paysim1" target="_blank" rel="noopener">kaggle&lt;/a>. Also, there are no nulls in the table. For the transaction type, cash withdrawals and payment transactions are the most common, while transfers and debits make up a smaller fraction of the data.&lt;/p>
&lt;figure >
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/case-studies-stout/plot1_huf1603f3d953d742a42b75553ae96b7e0_52234_2000x2000_fit_lanczos_3.png" >
&lt;img data-src="https://www.nalingadihoke.com/post/case-studies-stout/plot1_huf1603f3d953d742a42b75553ae96b7e0_52234_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="2313" height="1385">
&lt;/a>
&lt;/figure>
&lt;p>Here it can be seen that oldbalance* and newbalance* are highly correlated with each other since transaction amounts are typically a small proportion of account balances i.e, account balances have high correlations with themselves.&lt;/p>
&lt;figure >
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/case-studies-stout/plot2_hu8f029084c97eec4af5e9bf2e13b2bb5c_22391_2000x2000_fit_lanczos_3.png" >
&lt;img data-src="https://www.nalingadihoke.com/post/case-studies-stout/plot2_hu8f029084c97eec4af5e9bf2e13b2bb5c_22391_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="425" height="337">
&lt;/a>
&lt;/figure>
&lt;p>The dataset is &lt;code>highly imbalanced&lt;/code>. Over 99.8% of the transaction records are non-fraudulent. Because only a tiny fraction of the dataset represents fraud, fraudulent transactions are likely to be under-represented in the models. To counter this, proper scaling of samples will take place.&lt;/p>
&lt;figure >
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/case-studies-stout/plot3_hu2d46963ffc23a9d8666f87b2c3e55344_15825_2000x2000_fit_lanczos_3.png" >
&lt;img data-src="https://www.nalingadihoke.com/post/case-studies-stout/plot3_hu2d46963ffc23a9d8666f87b2c3e55344_15825_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="340" height="354">
&lt;/a>
&lt;/figure>
&lt;p>Below, the plots between amount and oldbalanceOrig can be seen. It can be concluded that amounts are skewed right—the vast majority of transactions are low amounts.&lt;/p>
&lt;figure >
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/case-studies-stout/plot4_hua1a543fb7a563281e595c981a587f7fc_14018_2000x2000_fit_lanczos_3.png" >
&lt;img data-src="https://www.nalingadihoke.com/post/case-studies-stout/plot4_hua1a543fb7a563281e595c981a587f7fc_14018_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="378" height="279">
&lt;/a>
&lt;/figure>
&lt;p>Lastly, there appears to be a weak positive relationship between amounts and destination account balances, which was observed earlier in the correlation matrix.&lt;/p>
&lt;figure >
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/case-studies-stout/plot5_hufe73711c3db4b038c1fb815b342e0fff_33984_2000x2000_fit_lanczos_3.png" >
&lt;img data-src="https://www.nalingadihoke.com/post/case-studies-stout/plot5_hufe73711c3db4b038c1fb815b342e0fff_33984_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="378" height="279">
&lt;/a>
&lt;/figure>
&lt;h3 id="logistic-regression-approach">Logistic Regression approach&lt;/h3>
&lt;p>For this analysis only the row data TRANSFER and CASH_OUT types were retained. There was a 30/70 split between test/train sets. The &lt;em>y_score&lt;/em> was found to be &lt;code>0.9197251695947122&lt;/code> and &lt;em>precision score&lt;/em> was &lt;code>0.9740788254011161&lt;/code>&lt;/p>
&lt;p>The closer the ROC curve is to the upper left corner, it means that more positive samples are predicted to be 1, and more negative samples are not predicted to be 1.&lt;/p>
&lt;figure >
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/case-studies-stout/plot6_hu3e388cde549fd6407928464100c7ef01_25366_2000x2000_fit_lanczos_3.png" >
&lt;img data-src="https://www.nalingadihoke.com/post/case-studies-stout/plot6_hu3e388cde549fd6407928464100c7ef01_25366_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="578" height="455">
&lt;/a>
&lt;/figure>
&lt;h3 id="xgboost-gradient-boosting-approach">XGBoost (Gradient Boosting) approach&lt;/h3>
&lt;p>The XGBoost classifier was used to train the 20/80 train/test split data. Further, Error columns for Origin and Destination were added to normalize values. The &lt;em>accuracy&lt;/em> was found to be &lt;code>0.9999530726256983&lt;/code> and &lt;em>F1 macro score&lt;/em> was &lt;code>0.9937862376528303&lt;/code>. Shown below is the corresponding decision tree generated.&lt;/p>
&lt;figure >
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/case-studies-stout/plot7_hu823d10ffaa2dcf0b4173d0ee6cd08dc4_744078_2000x2000_fit_lanczos_3.png" >
&lt;img data-src="https://www.nalingadihoke.com/post/case-studies-stout/plot7_hu823d10ffaa2dcf0b4173d0ee6cd08dc4_744078_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="15000" height="10000">
&lt;/a>
&lt;/figure></description></item><item><title>Pythonic applications of Linear Algebra</title><link>https://www.nalingadihoke.com/post/linalg/</link><pubDate>Mon, 30 Nov 2020 00:00:00 +0000</pubDate><guid>https://www.nalingadihoke.com/post/linalg/</guid><description>&lt;p>As the title suggests, this project saw me extend some of my linear algebra knowledge with inspiration from a &lt;a href="https://faculty.math.illinois.edu/~phierony/math415-2020.html" target="_blank" rel="noopener">course&lt;/a> I took in 2020. Here, simple face recognition is demonstrated, a timeseries is analyzed and other interesting applications are discussed.&lt;/p>
&lt;p>Principal Component Analysis (&lt;a href="https://en.wikipedia.org/wiki/Principal_component_analysis" target="_blank" rel="noopener">PCA&lt;/a>) is a way of capturing most of the variance in the data (in an orthogonal basis). It reduces dimensions and transforms the data linearly into new properties that are not correlated. Singular Value Decomposition (&lt;a href="https://jonathan-hui.medium.com/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491" target="_blank" rel="noopener">SVD&lt;/a>) will be utilized to diagonalize the matrices from which the basis vectors will be truncated to give us our principal components.&lt;/p>
&lt;h1 id="equations">Equations&lt;/h1>
&lt;p>Assuming a $m \times n$ matrix $X$, the principal components are defined as the eigenvectors of the dataset’s covariance matrix. Assuming $\hat{X}$ is the dataset centered at the origin, it can be said that $\hat{X}^{T}\hat{X}$ is proportional to the covariance matrix which means finding the eigenvectors for $\hat{X}^{T}\hat{X}$ is enough. These are represented by the columns of V in the reduced SVD of $\hat{X}$,&lt;/p>
&lt;p>$$\hat{X} = U\Sigma V^{T}$$&lt;/p>
&lt;h1 id="a-namefaceaface-recognition">&lt;a name="face">&lt;/a>Face Recognition&lt;/h1>
&lt;p>For this analysis I used, AT&amp;amp;T Laboratories Cambridge&amp;rsquo;s “&lt;a href="https://web.archive.org/web/20180802044943/http:/www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html" target="_blank" rel="noopener">Database of Faces&lt;/a>”, a set of gray scale face images (like in the image above) normalized to the same resolution. Each image is a flattened row of a larger ‘faces’ matrix. First, to zero center the faces matrix, an average face is computed and subtracted from each row of the matrix. The ‘average’ face is shown below.&lt;/p>
&lt;figure >
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/linalg/combined_face_hu857b26d4a62b18915171cd4a12d2a09d_34142_2000x2000_fit_lanczos_3.png" >
&lt;img data-src="https://www.nalingadihoke.com/post/linalg/combined_face_hu857b26d4a62b18915171cd4a12d2a09d_34142_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="353" height="413">
&lt;/a>
&lt;/figure>
&lt;p>Using &lt;code>SVD&lt;/code>, we can calculate the eigenbasis of the desired covariance matrix.&lt;/p>
&lt;pre>&lt;code class="language-sh">U, S, Vt = la.svd(faces_zero_centered, full_matrices=False)
V = Vt.T
&lt;/code>&lt;/pre>
&lt;p>As a side note, the principal components of any one of the images in the training set can be captured by converting them into the eigenface basis using the V vector found above.&lt;/p>
&lt;figure >
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/linalg/not_trained_hu56e2c3c43ec906b7851cb15fdbcaeef8_124232_2000x2000_fit_lanczos_3.png" >
&lt;img data-src="https://www.nalingadihoke.com/post/linalg/not_trained_hu56e2c3c43ec906b7851cb15fdbcaeef8_124232_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="1412" height="826">
&lt;/a>
&lt;/figure>
&lt;p>Now, with an unknown face (left) that the model is not trained on, first we will subtract the average face from it. The resulting image (right) will then be converted to the eigenface basis by a simple &lt;a href="https://textbooks.math.gatech.edu/ila/linear-transformations.html#:~:text=A%20linear%20transformation%20is%20a,n%20and%20all%20scalars%20c%20." target="_blank" rel="noopener">linear transformation&lt;/a>.&lt;/p>
&lt;pre>&lt;code class="language-sh">unknown_zero_centered = face_unknown - face_avg
unknown_basis = unknown_zero_centered @ V
faces_basis = faces_zero_centered @ V
&lt;/code>&lt;/pre>
&lt;p>To match an existing image, the &lt;code>“closest”&lt;/code> face in the face basis will be selected on the basis of least distance between the vectors.&lt;/p>
&lt;pre>&lt;code class="language-sh">n = 0
differences= la.norm(faces_basis - unknown_basis, axis=1)
n = np.argmin(differences)
plt.imshow(faces[n].reshape(face_shape), cmap=&amp;quot;gray&amp;quot;)
&lt;/code>&lt;/pre>
&lt;figure id="figure-prediction">
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/linalg/prediction_hu445b35a498ac64e373d08042fbf7af03_53133_2000x2000_fit_lanczos_3.png" data-caption="prediction">
&lt;img data-src="https://www.nalingadihoke.com/post/linalg/prediction_hu445b35a498ac64e373d08042fbf7af03_53133_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="353" height="413">
&lt;/a>
&lt;figcaption>
prediction
&lt;/figcaption>
&lt;/figure>
&lt;p>The above demonstration was a simple implementation of PCA on a data set of images where its assumed each face fills out a similar area in the image. Check out this &lt;a href="https://pythonmachinelearning.pro/face-recognition-with-eigenfaces/" target="_blank" rel="noopener">link&lt;/a> for a more complex implementation involving a trained neural net and scikit-learn&lt;/p>
&lt;h1 id="short-example-of-timeseries">Short Example of Timeseries&lt;/h1>
&lt;figure >
&lt;a data-fancybox="" href="https://www.nalingadihoke.com/post/linalg/time_combined_huabc0c6f2de722beb42ed8042a784a60a_5753847_2000x2000_fit_lanczos_3.png" >
&lt;img data-src="https://www.nalingadihoke.com/post/linalg/time_combined_huabc0c6f2de722beb42ed8042a784a60a_5753847_2000x2000_fit_lanczos_3.png" class="lazyload" alt="" width="4688" height="4024">
&lt;/a>
&lt;/figure>
&lt;p>&lt;code>PCA&lt;/code> can be used to split up various timeseries too. In the image below, the temperature data for six US cities is plotted (top left). Next, The average is subtracted to zero center the data like in the steps above (top right). Finally, the data is broken into its top two PCs (bottom).&lt;/p>
&lt;pre>&lt;code class="language-sh"># zero center the data
temp_avg = np.mean(temperature,axis=0)
temp_zero_center = temperature - temp_avg
# SVD breakdown
U,S, Vt = la.svd(temp_noavg)
V = Vt.T
# plotting the first two eigenvectors
plt.figure(figsize=(20,10))
lines = plt.plot((V[:,:2] ), '-', )
plt.legend(iter(lines), map(lambda x: f&amp;quot;PC {x}&amp;quot;, range(1,6)))
&lt;/code>&lt;/pre>
&lt;p>Since average temperature dips in the winter and peaks in the summer, the first component represents climates that remains relatively static year-round.&lt;/p>
&lt;p>Further Reading:&lt;/p>
&lt;ol>
&lt;li>Markov Matrics was one of the coolest take aways in linalg. &lt;a href="https://towardsdatascience.com/brief-introduction-to-markov-chains-2c8cab9c98ab" target="_blank" rel="noopener">This&lt;/a> article breaks down the concept and its applications in data science.&lt;/li>
&lt;li>Briefly mentioned in the above article, Google &lt;a href="https://en.wikipedia.org/wiki/PageRank" target="_blank" rel="noopener">PageRank&lt;/a> utilizes a special kind of square matrix called the &lt;a href="https://en.wikipedia.org/wiki/Google_matrix" target="_blank" rel="noopener">Google Matrix&lt;/a>&lt;/li>
&lt;li>Sandeep Khurana explains Linear Regression quite eloquently &lt;a href="%28https://towardsdatascience.com/linear-regression-with-example-8daf6205bd49%29">here&lt;/a>.&lt;/li>
&lt;/ol></description></item></channel></rss>