pszemraj commited on
Commit
fb298b1
1 Parent(s): ee1d00a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -43
README.md CHANGED
@@ -58,49 +58,8 @@ widget:
58
  \ and parameters 0, and generalization is influenced by the inductive bias of\
59
  \ this function space (Section 5)."
60
  example_title: scientific paper
61
- - text: ' the big variety of data coming from diverse sources is one of the key properties
62
- of the big data phenomenon. It is, therefore, beneficial to understand how data
63
- is generated in various environments and scenarios, before looking at what should
64
- be done with this data and how to design the best possible architecture to accomplish
65
- this The evolution of IT architectures, described in Chapter 2, means that the
66
- data is no longer processed by a few big monolith systems, but rather by a group
67
- of services In parallel to the processing layer, the underlying data storage has
68
- also changed and became more distributed This, in turn, required a significant
69
- paradigm shift as the traditional approach to transactions (ACID) could no longer
70
- be supported. On top of this, cloud computing is becoming a major approach with
71
- the benefits of reducing costs and providing on-demand scalability but at the
72
- same time introducing concerns about privacy, data ownership, etc In the meantime
73
- the Internet continues its exponential growth: Every day both structured and unstructured
74
- data is published and available for processing: To achieve competitive advantage
75
- companies have to relate their corporate resources to external services, e.g.
76
- financial markets, weather forecasts, social media, etc While several of the sites
77
- provide some sort of API to access the data in a more orderly fashion; countless
78
- sources require advanced web mining and Natural Language Processing (NLP) processing
79
- techniques: Advances in science push researchers to construct new instruments
80
- for observing the universe O conducting experiments to understand even better
81
- the laws of physics and other domains. Every year humans have at their disposal
82
- new telescopes, space probes, particle accelerators, etc These instruments generate
83
- huge streams of data, which need to be stored and analyzed. The constant drive
84
- for efficiency in the industry motivates the introduction of new automation techniques
85
- and process optimization: This could not be done without analyzing the precise
86
- data that describe these processes. As more and more human tasks are automated,
87
- machines provide rich data sets, which can be analyzed in real-time to drive efficiency
88
- to new levels. Finally, it is now evident that the growth of the Internet of Things
89
- is becoming a major source of data. More and more of the devices are equipped
90
- with significant computational power and can generate a continuous data stream
91
- from their sensors. In the subsequent sections of this chapter, we will look at
92
- the domains described above to see what they generate in terms of data sets. We
93
- will compare the volumes but will also look at what is characteristic and important
94
- from their respective points of view. 3.1 The Internet is undoubtedly the largest
95
- database ever created by humans. While several well described; cleaned, and structured
96
- data sets have been made available through this medium, most of the resources
97
- are of an ambiguous, unstructured, incomplete or even erroneous nature. Still,
98
- several examples in the areas such as opinion mining, social media analysis, e-governance,
99
- etc, clearly show the potential lying in these resources. Those who can successfully
100
- mine and interpret the Internet data can gain unique insight and competitive advantage
101
- in their business An important area of data analytics on the edge of corporate
102
- IT and the Internet is Web Analytics.'
103
- example_title: data science textbook
104
  - text: "Transformer-based models have shown to be very useful for many NLP tasks.\
105
  \ However, a major limitation of transformers-based models is its O(n^2)O(n 2)\
106
  \ time & memory complexity (where nn is sequence length). Hence, it's computationally\
 
58
  \ and parameters 0, and generalization is influenced by the inductive bias of\
59
  \ this function space (Section 5)."
60
  example_title: scientific paper
61
+ - text: "Is a else or outside the cob and tree written being of early client rope and you have is for good reasons. On to the ocean in Orange for time. By's the aggregate we can bed it yet. Why this please pick up on a sort is do and also M Getoi's nerocos and do rain become you to let so is his brother is made in use and Mjulia's's the lay major is aging Masastup coin present sea only of Oosii rooms set to you We do er do we easy this private oliiishs lonthen might be okay. Good afternoon everybody. Welcome to this lecture of Computational Statistics. As you can see, I'm not socially my name is Michael Zelinger. I'm one of the task for this class and you might have already seen me in the first lecture where I made a quick appearance. I'm also going to give the tortillas in the last third of this course. So to give you a little bit about me, I'm a old student here with better Bulman and my research centres on casual inference applied to biomedical disasters, so that could be genomics or that could be hospital data. If any of you is interested in writing a bachelor thesis, a semester paper may be mastathesis about this topic feel for reach out to me. you have my name on models and my email address you can find in the directory I'd Be very happy to talk about it. you do not need to be sure about it, we can just have a chat. So with that said, let's get on with the lecture. There's an exciting topic today I'm going to start by sharing some slides with you and later on during the lecture we'll move to the paper. So bear with me for a few seconds. Well, the projector is starting up. Okay, so let's get started. Today's topic is a very important one. It's about a technique which really forms one of the fundamentals of data science, machine learning, and any sort of modern statistics. It's called cross validation. I know you really want to understand this topic I Want you to understand this and frankly, nobody's gonna leave Professor Mineshousen's class without understanding cross validation. So to set the stage for this, I Want to introduce you to the validation problem in computational statistics. So the problem is the following: You trained a model on available data. You fitted your model, but you know the training data you got could always have been different and some data from the environment. Maybe it's a random process. You do not really know what it is, but you know that somebody else who gets a different batch of data from the same environment they would get slightly different training data and you do not care that your method performs as well. On this training data. you want to to perform well on other data that you have not seen other data from the same environment. So in other words, the validation problem is you want to quantify the performance of your model on data that you have not seen. So how is this even possible? How could you possibly measure the performance on data that you do not know The solution to? This is the following realization is that given that you have a bunch of data, you were in charge. You get to control how much that your model sees. It works in the following way: You can hide data firms model. Let's say you have a training data set which is a bunch of doubtless so X eyes are the features those are typically hide and national vector. It's got more than one dimension for sure. And the why why eyes. Those are the labels for supervised learning. As you've seen before, it's the same set up as we have in regression. And so you have this training data and now you choose that you only use some of those data to fit your model. You're not going to use everything, you only use some of it the other part you hide from your model. And then you can use this hidden data to do validation from the point of you of your model. This hidden data is complete by unseen. In other words, we solve our problem of validation."
62
+ example_title: transcribed audio - lecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  - text: "Transformer-based models have shown to be very useful for many NLP tasks.\
64
  \ However, a major limitation of transformers-based models is its O(n^2)O(n 2)\
65
  \ time & memory complexity (where nn is sequence length). Hence, it's computationally\