The following content is

provided under a Creative Commons license. Your support will help

MIT OpenCourseWare continue to offer high-quality

educational resources for free. To make a donation or to

view additional materials from hundreds of MIT courses,

visit MIT OpenCourseWare at ocw.mit.edu. GABRIEL SANCHEZ-MARTINEZ:

Any questions on Homework 1 before we get started? AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ:

OK, fire away. AUDIENCE: I guess,

first, do you think we have like this

minimum cycle time, like a theoretical minimum cycle

time and then what was actually [INAUDIBLE] cycle time? GABRIEL SANCHEZ-MARTINEZ: So

cycle time, just to review– it’s the time that

it takes a bus to– from the time

[AUDIO OUT] for a trip. It goes all the way one way,

has to wait at the other end to recover the schedule,

comes back, waits to recover, and is ready to

begin the next round. So that’s a cycle. AUDIENCE: Since you have

[INAUDIBLE] going on, if you had 4.1 buses,

then you use a cycle time. Then obviously,

you can’t do that? [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ: So

you would need five buses– AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ:

–if that’s what you’ve got. Or you would have to do a

trade-off with reliability if that were to happen. AUDIENCE: I think

most of my questions were on this very last

couple of questions. GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: We were aggregating

a bunch of data for– [INAUDIBLE] you did it

across both directions and then asked,

how does it change when you would like to evaluate

each direction separately in layover time? GABRIEL SANCHEZ-MARTINEZ: This

is the penultimate question, correct? AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ:

So that’s the hardest question

on the assignment. AUDIENCE: OK. GABRIEL SANCHEZ-MARTINEZ:

It is a challenge question because there are different

cases that you have to analyze. That’s maybe the hint, right? There are some cases. And for each case,

there is a probability that that case will occur. AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ: And–

let’s see if this starts– there’s a probability

that it will occur and then a consequence, or

something happens in that case. So you have to look at each case

and then aggregate the cases together, if that make sense. AUDIENCE: Yes. GABRIEL SANCHEZ-MARTINEZ: We’re

taking questions for Assignment 1, which is due on Thursday. Any other questions? AUDIENCE: That’s it. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

It is due at 4:00 so at class time

essentially, yeah. I actually [AUDIO OUT]

if you 4:00. I said 4:05, so you

have five minutes. AUDIENCE: Can you [INAUDIBLE]

what assumptions there are [INAUDIBLE]? GABRIEL SANCHEZ-MARTINEZ:

In what question? AUDIENCE: When you

said it seems to be the reasoning or assumption

about the schedule [INAUDIBLE]?? Which metric do you use? Based on the data,

which [INAUDIBLE]?? GABRIEL SANCHEZ-MARTINEZ: Yeah,

so that’s Question 3, correct? AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ:

So I can’t really explain. I can’t give you the

answer to the question. So what I’m looking for

there is your intuition and your understanding of why

you would pick which statistics from Question 2, where it tells

you calculate all these things. Now I’m saying pick

from those statistics what you would use

for t and for r. And you may want to combine

different statistics for the computation of r. Yeah? AUDIENCE: [INAUDIBLE]

multiple valid responses but– GABRIEL SANCHEZ-MARTINEZ: Yes,

some more valid than others, but some that are

definitely invalid and some that are almost 100%

valid but not 100% valid. So there are several

correct answers, and some that are

very good answers because you can justify

the choice of the statistic conceptually. Yeah. Any other questions

on Homework 1? I can take some more questions

after class, if that’s OK. So we had a snow day if you had

a good time, and/or at least, you could use it to catch up. So the schedule is a

little different now. I’ve posted an update about

that on Stellar (class site). There’s a new syllabus. And we’re going to do

some [AUDIO OUT] different [AUDIO OUT]. You may remember that we have

three introductory classes on topics of [INAUDIBLE]. And then, we had model

characteristics and roles. And then, [AUDIO OUT]. We’re going to

shuffle a little bit. [AUDIO OUT] Microphone working? So because the second assignment

is on data collection, we’re going to cover that today. And we’re going to give

you that homework today, so that you can get started

on the data collection side. Then, we’re going to cover some

of the short-range [INAUDIBLE] of planning concepts. Nema is going to do that– Nema Nassir. You might recall him from

the previous lecture. And then, we’ll finish

with [INAUDIBLE] and costs in March the 2nd, OK? So remember, there’s no

class on Monday the 21st. AUDIENCE: You mean Tuesday? GABRIEL SANCHEZ-MARTINEZ:

Sorry, yes, Tuesday. I think, there’s

no class on Monday. And then, Tuesday

there are classes. But it’s Monday’s schedule. So we don’t have class. Thank you for bringing that up. OK. I’ll leave Homework 2 for when

we finish with the lecture. But I’ll distribute it later. So let’s just get

started on that. So data collection techniques

and program design– that’s the topic for today. Here’s the outline. So we’re going to cover a

summary of current practice quite quickly. Then, we’re going to talk about

data collection program design process, the needs, the data

needs, the techniques for data collection, the sampling. We’re going to get

into the details of how we get sample slices. And we’re going to finish

with special considerations for surveys and

surveying techniques. so where are we? Where is the transit industry

in terms of data collection, and sampling, and these things? Largely, there’s

been a transition from manual to automatic

data collection. As you might imagine, with

the internet of things, and sensors, and the

internet, and wireless, it used to be that

if you wanted to have statistics on your

running times, you had to send people out. We call those people checkers. And those checkers would

have notebooks and record running times, and number

of people boarding, and these things. Nowadays, with the modern

systems, especially the modern systems, we have

several sensors and types of sensors that collect

some of that data for us. So we’re going to

cover both approaches. [INAUDIBLE] data

collection to supplement [INAUDIBLE] data collection. And if you happen to be

consulting for a developing country that is working with a

system that has not yet brought in automatic data

collection technologies, it’s also useful to know

all about the manual design and manual data

collection process. [AUDIO OUT] took this

class and ended up working in large consulting

firms have gone off to help countries put

in new transit systems. And one of the first

things they have to do is back to these slides and see

what the plan is going to be, and how many people you

need, and how much it’s going to cost. So very useful topic. So as I said, there’s

automatic data collection. There’s manual data collection. There’s sometimes a mix of

data collection techniques. Often, what happens is

that we just send people out and collect data. Or we just extract a sample of

automatically collected data. And we don’t really think about

sampling, and the confidence interval, and how sure

are we of that result that we’re going to influence

policy or make decisions that will affect service. How sure are we of those? So statistical validity. Often, there’s an

efficient use of data. And ADCS, which is Automatic

Data Collection Systems– we’ll use that abbreviation

throughout the course- presents a major opportunity

for strengthening data to support decision making. We’ll talk about

how that happens. Let’s first compare manual

and automatic data collection. So what happens with

manual data collection? You hire people, as I said. You hired checkers. So initially, there’s

no setup cost. There’s a low

capital cost to that. But there’s a high marginal

cost because if you want to collect more data,

you have to hire more people. Does that make sense? If you want to bring in an

automatic data collection system, you might have to

retrofit all your buses with AVL sensors. And that’s going to

cost you initially. So that’s a high

capital cost relatively. But low marginal cost– once

you have those systems in place, they keep collecting

data for you. And it’s almost free. You do need some maintenance

on these equipments. But comparing to

manual data collection, you have low marginal cost. Because of that marginal

cost difference, it tends to happen that when

you have manual data collection, you only pay checkers

for small sample sizes– just what you need. Whereas, once you put in

automatic data collection systems, they keep

collecting data. So you get much bigger data. Bless you. OK, in both cases, we can

collect data and analyze it for aggregate analysis

and disaggregate analysis. So you might want

passenger-specific data on things. Or you might want things

like just averages and aggregate things,

total number of passengers using the system. And when you’re doing

manual data collection, you can look at

quantitative things, things you can measure and count. Or you can also observe

things qualitatively. One example that I

saw in a recent paper was considering the

[? therivation ?] by student in some country. And they didn’t ask people

if they were students. They were looking at people’s– more or less, are they young? Are they carrying a backpack? And that would be the

labeling for your student. So that’s something that a

sensor might not do so well. Although now with machine

learning, who knows? But we haven’t seen that so. So you can do

qualitative observations when you’re doing

manual data collection. Manual data collection

tends to be unreliable, especially when people

aren’t very well trained and when you have a group of

different people collecting data. So each person might

have different biases. It’s hard to reproduce the

exact bias across persons. With automatic data

collection, you do the errors. And often, they

are not corrected. But if you do correct them,

and you estimate those biases just for them, you can end

up with a better result. Because of the small

sample sizes in manual data collection, you tend to

have to have limited spatial and temporal coverage of data. So for example, if you’re

interested in ridership in the system, it’s unlikely

that you will cover ridership in holidays for

[INAUDIBLE] system because there are

only a few holidays. And usually, you’re not

mostly interested in holidays. So chances are, you won’t have

data collection for holidays. Whereas once you install

automatic data collection systems, they keep

collecting data. So you get data at midnight

on President’s Day. So they’re always on. They’re always collecting data. Manual data needs to be checked,

cleaned, analyzed, coded, and sometimes put into systems

before they can be analyzed. That could take a while. You need to hire

people to do that. Whereas automatic data

collection systems often send their data to databases

in real-time or very close to real-time. [INAUDIBLE] you can start

analyzing things the next day. So you arrive in the morning to

your desk at a transit agency, and you have performance

metrics for yesterday. So you wouldn’t be able to do

that unless you have people working very hard if

you’re using manual data collection system. When we talk about automatic

data collection systems, there are many. But there are three types that

we refer to very, very often. And so the first one in AFC,

Automatic Fare Collection Systems. This is your fare box or your

fare gates in your smart card, your Charlie Card. You’re in Boston. You tap to enter the bus. And you tap to enter

the subway system. Increasingly, it’s based

on contactless smart cards. And those contactless

smart cards have some sort of

RFID technology with a unique identifier. When you tap that

card to the sensor, the sensor will read

that identifier. And it’ll do things like

fare calculation for you. But that record gets

sent to a database. And it’s there for people

like us to analyze and make good use of it for planning. So it tends to provide entry

information almost always. In some systems, like the

Washington, DC metro or the TFL subway, you tap in

to enter and exit. So you have both origin

and destinations. And if you always

have the systems on, then you have full spatial

and temporal coverage of all of the use of the system

at an individual passenger level. So very disaggregate–

sorry about that. Traditionally, these

systems are not real-time. So it might take a while

for those transactions to make it to the

data warehouse, where they’re available for

planners to analyze it. The calculation of how

much fare in some systems is in real-time. In other systems like

the Charlie Card, the stored value that you

have is stored on your card. So it may take a while if

you tap at a bus for that bus to go to a garage

and get probed– and for the data that has

been stored in that bus to be extracted from that

bus to the central server. There is a move– and we’ll talk more

about this when we get to fare policy and technology– towards using mobile

phone payments and using contactless

bank card payment systems. And those systems often

do the full transaction over the air in real-time. So we’re starting to

look at the possibility of having all this data

in real-time or almost in real-time. But it’s not there yet. AUDIENCE: [INAUDIBLE] can I

ask a question about that? GABRIEL SANCHEZ-MARTINEZ:

Yeah, of course. AUDIENCE: In terms

of smart card, where this balance is

stored on the card– GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: –if one can figure

out how to hack that card– GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: –then

what can [INAUDIBLE] fares through an elaborate

technology that I couldn’t do and most people couldn’t do. But maybe some could. GABRIEL SANCHEZ-MARTINEZ:

Yeah, definitely. So the Charlie Card system

is an example about– actually, MIT students

were the first to hack it. AUDIENCE: I’m not surprised. GABRIEL SANCHEZ-MARTINEZ:

So it’s older technology. It used a low-bit

encryption key. That’s a symmetric

encryption key. And they just brute forced it. They figured what the key was. They happened to use the

same key for every card. So once you broke that key,

you could take any card. And with the right hardware,

you could add however much value you want to that card. And– AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Yeah, yeah, exactly. We don’t think it’s

been a major problem. AUDIENCE: But it happens. GABRIEL SANCHEZ-MARTINEZ:

I haven’t seen MIT students selling special MIT cards. But that would be

criminal, of course. Yeah, so newer systems have

much stronger encryption. And they have different

encryption keys for each card. And certainly, when we’re moving

towards contactless bank cards, we’re talking about a much

more secure encryption. It’s your credit card

that you’re using to tap or your Android or Apple Pay. AUDIENCE: Account

based [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ:

Account based– and essentially, what you

have is a token with an ID. And then, the balance is not

even stored on your card. The account server is handling

the balance and those things. So much more difficult to break. Yup. OK, AVL systems, or Automatic

Vehicle Location systems– so these are systems that

track vehicle movement. So for bus, they tend

to be based on GPS. You have GPS on a bus, on the

top of the bus, a little hub. And it collects data every five

seconds or every 10 seconds. And these positions might

get sent either in real-time, or maybe they get stored

on the onboard computer and then are extracted when

the bus reaches the garage. So just GPS– sophisticated

AVL systems for bus also have gyroscopes to do

inertial navigation and dead reckoning, especially when

the GPS precision drops. And that happens especially

with the urban canyon effect. If you have tall buildings,

GPS signal bounces around. The dilution of precision messes

up the position of the bus. Or maybe you’re

entering a tunnel, and you want to

continue to get updates of positions inside the tunnel. So this is a

temporary system that kicks in and interpolates

positions and figures out how the bus is moving. For a train, it’s usually

based on track circuits. So we’re going to talk

more about track circuits. But essentially, a

track knows if a train is occupying that segment or

not occupying that segment. And there are often some sensors

that read with RFID technology the ID number of a car. And sometimes, you have a

sensor in the front of each car and [AUDIO OUT] each car. And so a computer will look

up the sequence of readings and follow track circuits

as they are being occupied and unoccupied– and in that manner, track

trains throughout the system. These systems were put in

place mostly for safety to prevent train crashes. And because of that, you

would need it to know buses or where a train was. They are available in real-time. They were designed

from the beginning to track vehicles in real-time. So that’s what we have. I guess what’s newer

is that now, we’re collecting them and keeping

them in a data warehouse so that we can

analyze running times. AUDIENCE: [INAUDIBLE]

these systems have benefit to the consumer? GABRIEL SANCHEZ-MARTINEZ:

They do. And that’s the newest

thing that has happened– that nobody thought

about consumers when they were put in place. So yeah, we are

talking about tracking, knowing how many minutes

I have to wait for my bus, for example. And those things are pushed

through a public API, so that if I’m a

smartphone app developer, I can go ahead and pull

data from this next bus app and make an app. And so people can download it,

and they know how many minutes they have to wait. Yeah, so definitely. So we have seen a lot of AVL

being pushed in that manner. We have not seen so much AFC

data or APC data being pushed. Obviously, you wouldn’t

want all the details of AFC being pushed. But you might want to know

how crowded is my next bus, or how crowded is my next train. And you might actually

alter your decision whether to wait

for a crowded train or walk a longer time

based on that information. So that’s coming. I think, in the next

few years, that’s going to start happening. So passenger counting– many

different technologies exist. For bus, we tend to have these

optical sensors in the back. You might see them if

you pay attention– broken beam sensors. They look like two little

eyes with two little mirrors on each door. And so when you cross

the beams, if you press one beam first

and then the other, that sensor will know– is a person coming into the bus? Or is a person exiting the bus? And you have that at each door. And it counts those beams

going in and going out. And often, this is

slightly inaccurate. So you might get more boardings

and lightings for a given trip. So at the end of

a trip, whatever remains in terms of imbalance

between boardings and lightings gets zeroed out. And the area is distributed

throughout that trip that was just run. And often, you still have

to do some error correction after that. But it’s a way of counting

people getting on and off. And that’s useful to get

how many people are riding the system and also

the passenger miles– the passengers multiplied

by distance, which is often a required reporting element

in things like the NTB, the National Transit Database. So for rail systems,

we have gates that count how many

times they open and how many times they close. So you might have that

kind of counting in rail. You also have

video-based counting– so camera feeds that

can be hooked up to a system that will

essentially track nodes moving inside that frame. And you can count things

that cross a certain line, for example. And you could do

that to count flows. And then for train, we also

have the weight systems. So this is only in trains. The braking systems in

trains apply braking force in proportion to the

load on each car. So if you have a

very heavy car, you need to apply stronger braking

force than in a car that is almost empty. If you don’t do that, then

you apply a lot more force per weight on the lighter car. That car is going to be the

one pushing the other cars or pulling the other cars

through the coupling. And that will eventually

break the [INAUDIBLE] at a faster rate. So what you want is,

each car to slow down at the same rate by itself

as much as possible. And for that, you need to brake

in proportion to the weight. And therefore, you have

these weight systems. They used to just do that. And more recently,

we hooked them up to a little

storage device that keeps track of the

weight and maybe Wi-Fi, so that each time it reaches

a station or the terminal, it sends the data off. And we might have a rather

somewhat [? unprecise ?] idea of how many people

are in the car just based on an average

weight of a person. And these are traditionally not

available in real real-time. [INAUDIBLE] you have questions? Yeah? AUDIENCE: You could

also just reconcile it with the other system, right? GABRIEL SANCHEZ-MARTINEZ:

Of course, yeah. AUDIENCE: So if you have– [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: –people early

can transport to get on to. GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Yeah, definitely. Yeah. And that’s cutting edge research

that’s happening right now. How do you do data fiction

and merge different systems? They all have errors. And how do you

detect when one is more erroneous than the other? And how do you mix

these data sources to get the most precise,

not just loads, but paths within a network and

things like that. Yeah. So any questions on these three

very important automatic data collection systems? AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ: Yup. AUDIENCE: So if

there [INAUDIBLE] AVL, what kind of reason

can be [INAUDIBLE]?? GABRIEL SANCHEZ-MARTINEZ:

So the question is, why might some of these

technologies produce errors? And in particular,

you’re asking about AVL. So each of these has

a different behavior. And within each of these

categories of technologies, each vendor’s system might have

specific things that happen. With AVL, the most

common thing is end of root problems–

detecting when a trip actually begins and ends. So AVL systems,

you have this GPS coming in every five seconds. Depending on your chip set, you

might get it more frequently than that. But you also actually

sometimes hook it to the doors. So if the door is opening, you

say, well, I must be at a stop. And therefore, let me

find which one is closest. So there are ways to correct it. But when you get to

the end of the route, it’s not clear always–

have you finished your trip? Or rather, are you

starting your trip already? So maybe if the terminal is at

the same place on the trip– the previous trip

ends at the same place that the next trip

begins, there might be a time where the doors

open and close various times. And the trip isn’t

ready to leave yet. And so you really have to

wait to see the bus leaving that terminal and moving. Sometimes, there

are false starts. So maybe another bus comes

along, and it needs that space. So the driver moves the

bus a few meters forward. And the system thinks

my trip has started. And then, when you’re

looking at aggregate data, you’re looking at, say, running

times at the trip level. You see these outliers

with very long times. And if you were to

plot them by stop, you see that the link

between the first stop and the second step is

sometimes very high, 15 minutes. And so you can throw those out. Or you can do some interpolation

or imputation of data. Some systems that care

very much about that will purposely

place the terminal stops sufficiently far

apart to prevent that from happening because

it is a problem. And this data is crucial to

planning service and figuring out how much resource you’re

going to put into each route. So yup. AUDIENCE: For tap cards,

[INAUDIBLE] and metros, some of them we have

to tap out to exit. It is because of

variable [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ: Yes. AUDIENCE: But in some systems,

it’s still a flat fare. You still have to tap out. Is the reason behind that

mostly data collection? Or is there anything

[INAUDIBLE] you’re going to still have to

tap out [INAUDIBLE]?? GABRIEL SANCHEZ-MARTINEZ:

So yeah, no examples of it come to mind. You might know one. AUDIENCE: MARTA? GABRIEL SANCHEZ-MARTINEZ:

OK, I haven’t visited. So yeah, data collection

might be a reason to do that. But I’ll have to get back to

you on why MARTA did that. But yeah, most systems that

have controls in and out are for fare policy

reasons and not for data collection reasons. We’re starting to see more

interest in data collection and in investing on

these technologies just for data collection. So maybe– but I’ll have to

check and get back to you. AUDIENCE: You mentioned some

systems separate their depots to not confuse the end

[? from the start point. ?] [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ:

Their terminal stops, yeah. AUDIENCE: What are

some examples of those? GABRIEL SANCHEZ-MARTINEZ: TFL

will do that in London, yeah. Yeah, so they’ll monitor this. And if they see that

this is occurring often, they will separate

the stops a bit. And the reason they do

that is because they have people whose job

it is to impute data when it’s incorrect. So if they don’t do that, and

the system is consistently producing bad data,

then that means they’re going to have to spend

human resources on correcting that data. So at some point,

it’s just easier to move the stop a little bit. It doesn’t have to

be a long distance. AUDIENCE: Got it. GABRIEL SANCHEZ-MARTINEZ:

It does not make the same and make it far enough apart

that the geo fences can be told apart from each other. Alright? AUDIENCE: Really small scale

data of the EZRide who I work for, actually you could

see real-time bus loads [INAUDIBLE]– GABRIEL SANCHEZ-MARTINEZ:

Oh, interesting. AUDIENCE: –which was actually

helpful if you’re dispatching, and you know a bus is

getting through people on it. [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Yeah, for real-time control. [INTERPOSING VOICES] AUDIENCE: But the

terminal at our station had a drop-off point

and a pick-up point. The drop-off point was

before layover [INAUDIBLE] was after for this exact

reason to make sure that it will go through

the drop-off point, reset, until people

get off of it. GABRIEL SANCHEZ-MARTINEZ: Yeah. Yeah, so it happens. [INTERPOSING VOICES] AUDIENCE: Definitely. [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

That sounds about right. OK, if there are

no more questions on the three very important

categories of automated data collection systems,

let’s talk a little bit about the data collection

program design process. So this comes from before

automatic data collection. And nowadays, we think a

little bit less about this. But it’s still important. So if you do need to

collect some data, there’s a structure that you

can follow to do it properly and to make sure that you

collect data efficiently, so that you don’t spend too much

resources on data collection and that you can answer

your policy or your planning questions. So based on your needs and

the properties of your agency, I say here, determine

property characteristics. That’s a North American term. A property is an agency. So if you see that,

that’s an agency. So based on the characteristics

of the service you’re running and your data needs, you can

select some data collection technique. We’ll get into what

some of these are. Then, you can develop

route-by-route sampling plans based on how variable

the data is in each case. And you can determine how

many checkers do I need. A checker is a person who

goes out and collects data. And then from that, the cost– so human resources. It’s a planning exercise. And what we do usually is that

we conduct a baseline phase. So that’s the first time

you go out and collect data. You don’t know much

about what you’re wanting to collect data on. So it might be only

matrices, or loads, the people getting on and off. So you have to go out

and do a bigger effort. And that’s called the

baseline phase effort. Once you’ve done that and you’ve

established some tendencies, you might want to monitor

that to see if it changes. So then, you do a lighter weight

data collection effort, where you go out and less frequently,

using fewer resources, you collect sometimes

the same thing. Or sometimes, you observe

something else that is related or can be correlated with

what you really want. And then based on a

relationship between the two, you can estimate

what you really want. So you can monitor

what you collected. And then, if you

detect that there’s been a trend or a change, and

you need to investigate it further, you might go ahead

and repeat the baseline phase to increase your accuracy. So one of the catches of

this is that to determine sampling plans, to

determine required sample sizes to achieve some

confidence interval, you need to know how

variable your data is. And if you haven’t collected

it yet, you don’t know. So you might have some default

values that you resort to. And we’ll get to that

later in this lecture. But you might also

do a pre-test, where you send some

people out, and you collect some data

to really start to get a sense of

how variable is it, and how big will my

sample requirements be, and how much will it

cost for me to do this. So this is the process

that you might follow. And there are different

data needs by the question that you’re trying to answer. So one way of looking

at that is, are you collecting things that

are for specific routes, or for specific route

segments, or at the stop level? Or are you using more aggregate

system level data collection? Are your questions

more system level? So system-level things

are more about reporting, and they might be tied to

things like federal funding. Whereas route-level things

and stop-level things are more important for planning. So when we talk about route

and route segment level, we’re looking at things like

loads at the peak load points or at some other key points. How many people are in the bus? The running time

is by the segment to do schedule that

has time points or maybe end-to-end to

your operations plan. Schedule adherence– are

these buses running on time? Or are my schedules

not realistic? Total boardings or

revenue, two things that are highly correlated–

so number of passenger trips. Boardings by fare

category– so you might say, well, I want

boardings, but I want to know how many

seniors are using this, and how many students are using

this, and how many people are using monthly passes,

and how many people are using pay-per-ride. So you have different

fare categories. And you might want to

segregate the data by that. You might want passenger

boarding and lighting by stop. So that’s what

APC would give you if you have an automated system. But you might also use a write

checker, who sits on the bus and counts people

boarding in a lighting. Transfer rates between routes–

to see you maybe you’re looking at changing

service so that people don’t have to transfer. Passenger characteristics

and attitudes– this usually requires some degree

of survey, where you ask people things,

passenger travel patterns. At the system level,

we have things like unlinked passenger

trips, passenger miles, linked passenger trips. This had the whole system level. So sometimes, you do route

level or route segment level analysis, and

then, you aggregate to get the system-level things. That’s usually how you proceed. But the requirements in

terms of how many of these you have to sample

might be different. So if you want to achieve a

certain accuracy at the system level, you don’t need

to achieve the accuracy for each of the routes

that are in that system because you might have– so if you want to

say 90% confidence in some system-level

data element, you might only need 80% or

70% of the element level. And once you bring

those altogether, you achieve the

90% that you need. So data inference, I talked

about how sometimes we can infer items if we don’t

observe them directly. So from AFC with AFC is a

low-fare collection system, we have boardings because

people are tapping into the bus or tapping into

the subway system. And if we have APC, we

count people getting on. So we can look at total

number of boardings that way, if that makes sense. That’s pretty direct. Sometimes, you want to correct

for errors in the APC system, or you might have things

like variation affecting that number– like it goes

from AFC to how many people were actually in that bus. How many people

actually boarded? So you might do a little

bit of manual surveys to check what that relationship

is and apply some correction. For passenger miles,

we need to know how many people are at the

bus between each stop here. So AFC gives you boardings

and only boardings. APC gives you ons and offs. If every bus had APC, then you

could calculate passenger miles directly. But often, you have systems

where only a portion of the fleet has APC. So maybe 15% of your fleet

is equipped with APC. And from that, you get

the sample OD matrix. And you can use that

OD matrix to convert from boardings only to the

distribution and the ons and offs at all bus routes. And from that, you can

get passenger miles. Or you might just

use your buses that have APC, if that suffices

for your data collection unit. Same thing with peak

point load– similar idea. The AFC only measures boardings. So it doesn’t give you the

peak point load automatically. But from APC, you could get it. And it you can establish a

relationship between boardings and the peak load

point, then you can use that model to

infer the peak load point from just boardings. So this is a key thing to

be efficient about data collection. Any questions on this idea? Yup. AUDIENCE: So to get

passenger miles, you’re also going

to have a GPS system as well to know the distance? Or are we just basically

[INAUDIBLE] this is the routing [INAUDIBLE]? GABRIEL SANCHEZ-MARTINEZ: Both. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Yeah, both. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

What tends to happen is that the APC, it’ll come in. And it’ll say, at this stop,

this many people boarded. This many people are lighted. So you have other

layers in your database that say where the buses

and what the distance is between stops and

the stop pair level. So you then essentially

know how many people are riding on each link

and how long that link is, and you multiply the two. So yeah, passenger miles. Yeah, more questions. AUDIENCE: Yeah, for these

checks that are going on like the more manual checks– GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: –I know

often, there’s derivation checkers who

are coming into a check. GABRIEL SANCHEZ-MARTINEZ:

That’s right, yeah. AUDIENCE: Do they also use

that data to cross-reference the passenger counts? As in, [? this ?]

person gets on, and they check everyone’s

voice to [INAUDIBLE] DFL. GABRIEL SANCHEZ-MARTINEZ: Yeah. AUDIENCE: They then know

exactly how they go on the bus. GABRIEL SANCHEZ-MARTINEZ: Yes. Yeah. AUDIENCE: Do they use that data? GABRIEL SANCHEZ-MARTINEZ:

Yeah, they can. In the APC, sometimes

there’s reliability problems, especially when

vehicles are very full because

sometimes, people will block the sensor by the door. Actually, people like

to stand by the door all the time, even when

the bus isn’t full. And that kind of affects APC. You might notice

this on the one. If you take the one– so yeah, you sometimes have a

little bit of a manual effort to figure out. Just learn about

your APC system, and what are the errors,

and when do you see them. It often happens that you

have more variation when you have very high loads. And that’s when APC

is least accurate. So it all comes together. Yeah. Questions on the back? I think I saw a question. No? AUDIENCE: Yeah, I

noticed that in Chicago, when the bus would be crowded,

then people get off the bus. They let people off– GABRIEL SANCHEZ-MARTINEZ:

That’s right. AUDIENCE: –and then back on. GABRIEL SANCHEZ-MARTINEZ: Yeah. Yeah. These double things. But somebody might be by

the door just blocking the two little sensors– [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ:

–the two little eyes. And that’s it, no records

of people getting on or off. So if you’re doing a little

data collection, as I said, we use checkers. And actually, your

second assignment, you will be checkers of some kind. The typical checkers which you

won’t be in this assignment are ride checkers

and point checkers. So a ride checker sits

in the vehicle and rides with the vehicle. And the typical thing that these

ride checkers are looking at is, how long did it take

to cover some distance? So what was the running

time for that trip? And also, people

getting on and off– so they act as APC essentially. And they act as AVL. So AVL and APC

together might replace most of the functionality

of a ride checker. Although a ride checker often

can conduct an onboard survey, asking passengers about

where are they going, or their trip purpose, or things

related to social demographics, which are qualitative and cannot

be collected with the sensors. Point checkers stand

outside of the vehicle. They stay at a specific

place, and they can look at headways

between buses– so how long did it take

between each bus to come by, and how loaded were these buses? So if you’re interested

in the peak load point, and you know where the

peak load point is, and you just want

to observe, measure what are the loads

of the peak load point, then you can

just station a point checker at the peak load point. And if that person

is strained, we’ll be able to more or less say how

many people are in the vehicle from looking at the vehicle. With automated data

collection systems– yeah, with a fair system,

we have passenger accounts. We have transaction

data, which is very rich. It will tell you not

only that somebody is entering or exiting,

but also how much they’re paying, sometimes information

about the fare product type, which might help you

infer if this person is a senior, or a student, or a

frequent user, an infrequent user– so many things that are

very useful for planning. And we’ll get to play with some

of these later in the course. And then, there’s Automatic

Passenger Counters, APC. So as more and motor systems

switch to automatic data collection, we still use

some manual data collection, but not in the

traditional sense. Now, we reserve those

resources for things like surveys about social

demographics and other things. And we also carry out

web-based surveys, which would have some biases. But if people

registered their cards, and you have email

accounts, you can maybe send a mass email to everyone

and carry out surveys. The MBTA does that. Maybe some of you

are in the panel of people who are e-mailed

every now and then. Is anybody in that panel? No hands. I’m in that panel. But I know somebody must be. So yeah, they send an email, and

they ask about your last ride. And they say, where

did you start from? What were you doing

this trip for? How long did you have to walk? Are you happy with the system? Was your bus on time? Yeah, things like that– how satisfied are you? It’s a survey with

qualitative questions that you couldn’t

collect automatically. It’s [INAUDIBLE] seeing

things about your experience outside of the bus, which

there are no sensors for. All right, sampling

strategies– a bunch of different ones

and the simplest one is called simple random

sampling– very, very simple. So when you have

sample random sampling, what happens is that

every trip, if you’re looking at surveying trips,

for things like how many people boarded this trip– let’s take that as an example. Then, if you’re using

simple random sampling, every trip has equal likelihood

of being picked and being surveyed. So if you go through

your process, and you determine that

you need to observe 100 trips to get an

average reliably. And you’re going to use

that to plan something, then you need to

look at 100 trips. So if you use simple

random sampling, you take your schedule, and

you randomly pick 100 trips. And that’s your sample. Those are the ones that you

send people out to collect data. Now, there’s a little bit

of a problem with that. It’s not the most

efficient method because if you’re going

to send someone out, and that person is going to be

active, and require some time to get to the site and

some time to return, then once they’re out

there, you want them to collect as much as they can. So that’s not simple

random sampling. That’s cluster sampling. Before we get to that

systematic sampling– so typically, instead of

picking randomly, we say, OK, we need to get

10% of the trips. So let’s just make it

such that we count. And maybe it’s every five

trips, we have to survey it. So now, it’s evenly spaced. And this is useful

for some things. One example is weekday,

picking the weekday that you’re going to survey on. So the technique that is often

used is sample every six days. Why would that be? Yeah. So if you do it every seven,

then you always have a Monday. And that’s going

to get some bias if Mondays happen to

be low ridership days or high ridership days. So if do every sixth

day over a year, you have a good sample

of every week day. So that’s an example

of systematic sampling. But you still have

that issue of it might not be the most efficient. Cluster sampling, sometimes

it’s more efficient once you send out

a person to collect data to do as much as possible. And you survey a cluster. So one example is, if

you’re distributing surveys to passengers, and you need

to distribute 100 surveys. If you do 100 simple

random sample, then those people might be in

different parts of the system. And one might be

the first person you see getting off

at South Station. And then another

one by me might be the first person you see getting

off at the Kendall station. So that’s very inefficient. So a cluster might be

everybody on board a bus, and that will get a

bunch of people together. However, it’s not as efficient

statistically to do that. So you can’t just add

up to 100, and you’re done because there might be some

correlation within the people riding that vehicle

that they will tend to answer in a similar way. So you might need to

increase your sample size when you use this technique. But still, you might have a

more efficient sampling plan. Then, there is the

ratio estimation and conversion factors. We gave examples

of this already. This is in the context

of baseline phase and then monitoring phase. So you start out with

a baseline phase. And in the baseline

phase, you collect the thing you really

want and something that is very easily collected

with lower resources. And you make a model

of the thing you really want as a function

of the thing that is cheap and easy to collect. And then, on the

monitoring phase, you only measure the thing that

is cheap, and easy, and quick. And you then use the model to

estimate what you really want. So converting AFC boarding

to passenger miles, we give an example of that. We’re converting

loads at checkpoints to load somewhere else. So maybe only measure

loads with a point checker at the peak load point. And you have some relationship

to convert those loads to loads at other key transfer

stations as an example. And then, the

stratified sampling– so one of the things that

determines how big of a sample you need is the

variability in the data that you’re collecting. So correlation,

when you’re looking at a whole system with multiple

routes or multiple segments– maybe when you

look at one route, there’s some variability

of running times. But they have a central

tendency as well. And when you’ve

got a second route, you have also some

variability and a different central tendency. So you bunch all

the data together, some of the variability across

data points in our data set are going to be the inherent

variability of each route. And some of it will be

systematic– the differences between both routes. So if you do a

simple random sample, and you don’t separate

the systematic variability from the inherent

variability, then you’re going to get a

wider variability. And you will require

a bigger sample size. Stratified sampling

is an approach where you determine sample sizes

for each of these separately. And it’s more efficient

if you do it well because you eliminate

the need, or you at least reduce the need, to

collect data for the sake of the systematic differences

between different parts of the system. Any questions on these methods? Yes. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Yeah, so let’s maybe pick

another example. Let’s say that you’re looking

at the proportion of passengers in a bus who are students. And you’re

distributing a survey. And they tell you whether

they’re students or not. And you want this

for the whole system or for at least a

group of routes. And it tends to be that some

routes don’t serve universities and don’t serve schools. So they have a lower

proportion of people. And then, some routes that

do go through universities, and they have a higher

proportion of students. So if you just want the

system-wide proportion of people who are students, and

you join all these data points together, there’s going

to be a lot of variability in what proportion that

is across every trip that you survey, correct? So in some sense,

it will indicate that because of

that variability, you’re going to need a

higher sampling size. You’re going to have

to survey more trips to get at your desired

accuracy level and tolerance. But now, if you say no, I’m

going to split routes in two, into two stratas. One is the routes that

serve the universities. And these tend to have

around 50% proportion. And then, there’s the routes

that don’t serve universities. And these tend to have

proportions near 0. So if you’re in your 0, you

might require a lower sample size to cover those. And you can just

very efficiently cover most of your

bus routes that way. And then, focus your

efforts on just the ones that have higher proportion. And you achieved your

system-level tolerance requirements with much fewer,

with by far fewer resources required to collect the data. Does that answer your question? Yeah. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

So what he meant by inherent is that within each bus

route or within each strata, there will be some variability. Even within the trips that

are serving universities, every trip might have

a different proportion. So there’s going to be a little

bit of variability in that. But if you mix that with trips

that are not serving students, then you pull all

that data together. Then, it’s going to look like

the variance of that data set is much higher. All right, so we’ve

tossed these terms around– tolerance,

confidence, level accuracy. So let’s define

them more precisely. Accuracy– when we

talk about accuracy, that has two dimensions. So somebody might say, the

average boardings per trip is 33.1. And then, the

question that follows is, do you mean exactly 33.1? How certain are you of that? And how accurate is that? So when we talk about tolerance,

there’s relative tolerance, and there’s absolute tolerance. Relative tolerance

is expressed in terms of a percent of the amount you

were collecting or a fraction. So you might say mean

boardings per trip is 33.1, plus or minus 10%. And that’s the 10% of 33.1. That’s why it’s

relative tolerance. Then, there’s

absolute tolerance. So mean boarding per trip

is 33.1, plus or minus 3.3. Now, in this case, these

two are equivalent. 3.3 in absolute

terms is 10% of 33.1. But this was expressed

in absolute terms, and the previous one was

expressed in relative terms. So don’t always assume

that if you see a percent, it’s relative because if what

you’re measuring is in itself a percent, unless you’re

using a percent of a percent, then it’s absolute. So here’s an example. Mean percentage of students

is 23%, plus or minus 5%. That’s absolute because

it’s 5%, not 5% of 23%. First, we talked about,

is that exactly 33.1? Or is it something

different from 33.1? Then, the second question

is, how sure are you, how confident are you

that the number you give, plus or minus the tolerance

you give, is the right answer? So now, you say

I’m 95% confident that the mean boardings per

trip is 33.1, plus or minus 10%. So now, you combine

the tolerance with the confidence level. And that’s the full

expression of your accuracy. And that’s what you need when

we look at the data collection. So you have two different

things that you could play with. And what happens typically

is that you choose a high confidence level– 90%, 95 percent are typical. And then, you hold that fixed. And you calculate what

level of accuracy you need. Or rather, you decide

what level of accuracy you need, depending on the

question you want to answer, and the impact it could

have on the system. So if you’re looking to

[INAUDIBLE] something that will have very significant

effects on the service plan or maybe on investment

in the system, then you might need

a higher accuracy. But if you’re collecting

data just for reporting, maybe it doesn’t matter as much. And you don’t need to spend as

much money on data collection. So as an example here, the

National Transit Database– NTD, we call it NTD– for annual boardings and

passenger miles, it says, you should collect

data to achieve an accuracy of 10%, relative

tolerance at 95% confidence level. You need both. So take home message about this. The other thing,

the t distribution– so this is a probability

distribution that is bell-shaped. It kind of looks like

the normal distribution. And it approaches the

normal distribution as the sample size

gets very large. This is the distribution

that arises naturally when you’re estimating the

mean of a population that is normally distributed with

unknown mean and variance and some known sample size. So to the right here,

we have your equations that I’m sure you’ve seen

before for sample mean, sample variance. And I guess, what’s

important to think about is that the

distribution of what you’re collecting–

for example, you might be collecting data on a

number of people boarding route 1. So that might have

some distribution. As you collect

more and more data, so as you survey

more and more trips, the distribution of how

many people board each trip does not necessarily

have to be normal. But it turns out from

the Central Limit Theorem and other laws and properties

of statistics and probability that the distribution

of the estimator– so the distribution of the

mean that you calculate based on that sample that

you collected– is normally distributed as

the sample size increases. So if you have a

lower sample size, instead of using the

normal distribution, use t distribution. Sometimes, we call that

a student, the t student distribution. And this distribution gets wider

as the variability increases and as the sample

size gets smaller. It has a property called

degrees of freedom, which is sample size minus 1. And you can see from this

chart right here when you have degrees

of freedom equals 1, which means you

collected two data points, it’s wider than when

V approaches infinity. And what you have in black here,

the thinnest and least variable of these, is essentially

a normal distribution. And this is the distribution

not of what you collected. It’s not the distribution

of the number of people who boarded route 1. It’s the distribution of

the mean that you estimate. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Exactly, it’s a sampling distribution

of the mean. And if you were to repeat that

experiment with the same number of trips but different

number of trips, you might get a

slightly different mean. So if you were to repeat

that many, many times, the distribution of those means

would be shaped in this manner. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ: Yeah,

well, student t distributed. And as sample size increases to

infinity, normally distributed. Harry. AUDIENCE: So just for V equals

5, I think you [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ: 4. AUDIENCE: 4. GABRIEL SANCHEZ-MARTINEZ:

Sorry, 6. 6. AUDIENCE: Approximately

5 [INAUDIBLE].. GABRIEL

SANCHEZ-MARTINEZ: Yes, 6. Yeah. I mispoke. [INAUDIBLE] AUDIENCE: When there’s a sample

variance, sigma x squared equals roughly. Is that not supposed

to be an equals? Is that not the way the

sample variances define? Because I thought it’s the– GABRIEL SANCHEZ-MARTINEZ:

So– –it’s below the variance

of distribution. But that’s roughly [INAUDIBLE]. AUDIENCE: Yeah, I guess

the issue is that you don’t know the true mean. So you’re using an estimate to

calculate the sample variance. And therefore, it’s almost,

almost the sample variance. GABRIEL SANCHEZ-MARTINEZ: Right. But I thought– AUDIENCE: You’re

using an estimator to do the– that’s

what you have to do. [INTERPOSING VOICES] AUDIENCE: He’s

incorporating the fact we’re dividing by n minus 1

rather dividing by [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ:

No, so n minus 1, that has to do with the

degrees of freedom issue. And that’s to go from population

variance to sample variance. But the other thing

that happens is that if you’re doing

the population, then you know exactly

what your mean is. It’s exact, right? AUDIENCE: Yeah. GABRIEL SANCHEZ-MARTINEZ:

And then in that case, you would know what the

exact variances is as well. Yeah. So the n minus 1

is just to remove a bias that would arise from

collecting only a sample. AUDIENCE: But here

for example, you can say this is

equals to [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ:

Yeah, yeah, yeah, yeah. AUDIENCE: You’re

working with the sample to know it would be an

approximate [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ:

Yeah, in practice equal 2. AUDIENCE: As your

sample distribution increases, then obviously,

your sample increases– [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ:

And therefore, this becomes more and more accurate. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Exactly. AUDIENCE: It should be

approaching more [INAUDIBLE].. GABRIEL SANCHEZ-MARTINEZ:

Yeah, so I guess what’s

important to realize is that this is an estimate of

the population variance, which in itself uses another estimate. And I guess, that’s

why that’s there. But it’s a very small detail. I didn’t mean to distract you. AUDIENCE: So for the n, is it

the sum of all the different samples of [INAUDIBLE]

or is it just– [INTERPOSING VOICES] GABRIEL SANCHEZ-MARTINEZ:

So you don’t ever repeat the experiment like this. This is more of a

theoretical explanation to why there is a

distribution to the mean, even though you only have one. You only have one mean, right? Because you’re going

to collect data. And once you finish

collecting data, you’re going to calculate

the mean of all that data. So you only have one mean. If you were hypothetically

to repeat that experiment, and you calculated separate

means for each one, then you would

get a distribution that would look like this. In practice, you would just

increase your sample size and still compute one mean,

which would be more accurate. Yeah. OK, let’s move on. So tolerance and

confidence level– so we have these distributions. These are the distributions

of the statistics, of the mean in this case. They are bell-shaped. As your sample size increases,

the degrees of freedom goes up. And your accuracy goes up. And the variance of that

statistic distribution decreases. So it gets thinner. So here in red, you have a

distribution with a smaller sample, and therefore, less

accuracy or less confidence would look like. And then as you increase

your sample size, you see that it

becomes more peaky. So when we talk about

tolerance, and let’s come back to the concept

of absolute tolerance in particular, we’re

talking about the distance between the center of

that distribution, which is a symmetrical

distribution, and some limit. So we’re saying, if you have

a tolerance of plus/minus 10. Then, you’re going to

measure 10, say 10 boardings, from the center to the right

and from the center to the left. And that’s your

absolute tolerance. So when you calculate

absolute tolerance, you can express that

tolerance as a function of the variance and/or

the standard deviation, rather of your mean. So instead of saying 10,

you could say 2 times the standard deviation of that

distribution using the equation that we just calculated. And that’s very convenient. Why would we do that? Why would I want to

complicate things that way? AUDIENCE: [? Outside ?]

[? of ?] a cumulative GABRIEL SANCHEZ-MARTINEZ:

No, I mean, there’s a mathematical convenience here. What is this a function of? It’s a function of

the standard deviation of the thing you were collecting

and your sample size, right? And what do we want to do? We want to determine

how many things we need to collect, right? So here we go– we have n. And now we can solve for

n, we have the sample size that we require for

a given tolerance. So we’re going to decide

what the tolerance is and calculate sample size, a

minimum required sample size. You can always

collect more data. All right. So again, to review,

this is the same equation I had in the last slide. You have absolutely tolerance. You can express

that as a multiplier times the standard

deviation of the mean. And then you solve for n,

and you get this equation right here. t is your tolerance

and you can– oh, sorry. t is the number

of standard deviations from the mean. d is your tolerance,

which you choose. And this is something

that you know, or collect, or approximate. So these are all given. Where does t come from? Well, we said that

we’re going to use the t distribution, right? So the t distribution

has a table– or it has a certain

shape, rather. And using Excel or

looking up at some table, you can figure out

what t is for two times the standard deviation

from the center. So you can just plug it

in from Excel or from– it’s a property of the

distribution, essentially. Once you pick a confidence

interval, you know t. If you want to go to 95,

it’s a certain value. If you want to go to 90,

it’s a different value. OK. When we look at

relative tolerance, relative tolerance is

just absolute tolerance divided by the mean that

you are collecting, correct? Because instead of saying

plus or minus 10 boardings, we’re saying plus or

minus 5% of the mean. So we just take absolute

tolerance and divide by x bar, the sampling mean,

the sample mean. And we solve for n again. So what we have now, it looks

very similar as to the question right here. But now we have the mean

and the denominator. OK, this quantity,

standard deviation divided by mean, sample

standard deviation divided by sampling mean, is called

the coefficient of variation. And there’s a

convenience to this. And there’s actually

a reason why sometimes relative

tolerance is preferred to absolute tolerance. It’s because of

this, because there’s a mathematically convenient

characteristic of property coming out of this– that you don’t need to know

the standard deviation of what you’re collecting to figure

out your sample size. We’re kind of running

in circles here, right? We’re saying that to

determine sample size, you need to know the

standard deviation. Well, I haven’t collected data. So I don’t know how

variable the data is. So that’s an issue. Now I have to

estimate what that is. It tends to happen that the

coefficient of variation is a more stable property

than the variation in itself, than the variance or the

standard deviation itself. So you’re more

likely to get away with using default values for

the coefficient of variation than you are with assuming a

specific standard deviation. AUDIENCE: It should be noted

that it’s unitless, coefficient of variation. GABRIEL SANCHEZ-MARTINEZ:

Yes, it is unitless. Thank you. OK. So what happens is that relative

tolerances are typically used for averages. So here’s an example– you measured 5720

boardings plus minus 5%. So if you were to get

the absolute equivalent of the absolute

tolerance of that. That would be 5% of 5720. That would be 286 passengers. That’s a weird thing to report. 5% is more

understandable, right? And it kind of makes more sense. So that’s what we want

naturally, anyway. So as I said, the

coefficient variation is typically easier to guess

than the mean and the variance separately. So we use that. Here’s an example using

the t distribution, where the sample

is not large enough to assume a normal distribution. So we say, let’s have a relative

tolerance of plus minus 5%, a confidence level of

95%, and a coefficient of variation of 0.3. So we start out

assuming large sample, and therefore degrees

of freedom is infinity. We can use the

normal distribution. If we look at the

normal distribution, with plus minus 5%, confidence

level 95%, the t is 1.96. So we look that up on a table,

or we use Excel norm dist, or– yeah. t dist for t and

norm dist for normal. We got 1.96. We plug in the

relative tolerance, the 0.3– we get 140. 140 is not quite

infinity, right? So if we look at 140

as a sample size, that would imply that all the

degrees of freedom is 139. Now we go back and

look at the t dist, and we change 1.96 to the

value from the t distribution for that degree of freedoms. And we get 140.73. So you’re sort of seeing

that you were almost right. 140 is very large. In practice, you would

just round up a little bit and get a nice round number, and

you would even play with this once you’re looking at planning

who you’re going to send out and how many hours

you’re going to collect. You want to get at

least 141, but if you’re going to have people in units

of eight hours, for example, or units of four hours, then you

might as well finish the batch for four hours, the last one. Maybe you’ll get

150, 160 from that. Here’s an example

of that equation with different assumptions

of confidence and tolerance. And so we’re using

90% confidence, and we’re assuming a

certain sample size here. So you can see that, as the

tolerance decreases, which means that you require

a greater accuracy for different

coefficients of variation, the sample size can

get really large. So if your data is

not very variable, then you can sample

just a few trips. And you know because

they don’t vary that much what the mean is. But if there’s a lot of

variability across strips, then you need more. So that’s what you see as you

go down the rows on this table. Here we have tolerance. If you only have to be 50%

accurate, plus minus 50%, then you don’t have to

collect that much data. If you want to be

more precise, and you want to say plus minus 5%, then

you need a bigger sample size, right? OK. Proportions– and the

homework, actually, is based on proportions,

so this is important. Consider something, a

group of passengers, to estimate the proportion of

passengers who are students. So from probability,

when you are looking at an event

that can either be 0 or 1, or black or white– in this case, students

or non-students– there’s a certain probability

that that person is a student, right? And what you want to

estimate is that probability or, in other words, what

percent of the things you observe are students. So from the properties of

the Bernoulli distribution, the variance is p

times 1 minus p. So if everybody is a student,

or nobody is a student, either way there’s no

variability, right? So you would have 1 times 1

minus 1, 1 times 0, 0– no variability. Though at the peak variability,

the highest variance of this distribution, is

when 50% of your people are students, so 0.5

times 1 minus 0.5, 0.25. That’s the highest variance, OK? So the tolerance is

typically specified in absolute terms when you’re

estimating proportions, because the proportion

is in itself a percent. So you use absolute tolerance. And you just substitute,

essentially, this variance. You put in the variance of

the Bernoulli distribution, which is p times 1 minus p. And that’s how you get the

sampling equation, sample size requirement equation. Here’s a problem. We don’t know in advance what

the proportion will be, right? And we need that to know how

many people we need to survey to figure out– or how many

trips we need to survey to figure out– sorry, how many

students we need– how many riders

we need to survey to figure out what the average

number of students are. OK, so– AUDIENCE: And it’s also

a [INAUDIBLE] p times 1 minus p [INAUDIBLE] is

a constrained number. GABRIEL SANCHEZ-MARTINEZ:

It is a constrained number, and that’s exactly

where we’re going. So we use something called

absolute equivalent tolerance instead of absolute tolerance. We assume that p is 0.5– that’s the maximum it could be. So let’s go ahead with

a worst case scenario. And then what happens

with p itself? Well, if your percent

is high, then you can tolerate a

bigger number, right? So if it’s 32%, you’re

probably OK with plus minus 5%. If your average were

1.2, plus minus 5% is not that good, right? You need a higher– you need a much stricter,

tighter confidence interval for that. So probably not good to do

plus minus 5% in that case. AUDIENCE: [? Well, do ?]

[? you mean ?] you have a plus minus 5% absolutely percentage? GABRIEL SANCHEZ-MARTINEZ:

Absolute, yeah. AUDIENCE: And you’d be

going negative [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Negative, which is possible but

difficult to interpret. AUDIENCE: Sorry, so this isn’t

actually 32% plus or minus 5% of 32 [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

It is not– yeah, it’s absolute tolerance, not

relative tolerance, right. So what’s convenient about this

is that these two factors work in opposite directions. So as you get bigger, as the

proportion gets closer to 50%, the variance increases. So oh, well, we need

a bigger sample. But your tolerance

increases as well, so you don’t need

as big of a sample. And so it’s convenient. And the practical solution

is assume p is 0.5 and work in terms of absolute

equivalent tolerance. So you pick a tolerance

under the assumption that our proportion is 50%. And here’s what happens. Yeah, if the expected

proportion is 50%, and you say plus minus 5

percent, what you would get is this 5%, if it

turns out that p is 5%. But if it worked more to the

extremes, like 5% or 95%, what you would actually achieve

from having planned the survey, assuming 50%, is 2.2– so much better,

much more acceptable to say 5% plus

minus 2.2%, right? So it works out. And there’s a

convenient equation if you assume a very large

sample, or large enough sample, and you pick 95%, 0.25,

which is the variance times the normal

distribution t squared is 0.96, which is almost 1. So then you get this equation. You take 1, you divide

it by the tolerance that you want, your equivalent

tolerance, and that’s your sample size. So it doesn’t depend on anything

about the data in itself. You just say if I want, on

whatever I’m collecting, whatever proportion

I’m collecting, a 5% absolute

equivalent tolerance, then I need 400

surveys to be answered. Yeah? AUDIENCE: So this

assumes a random– GABRIEL SANCHEZ-MARTINEZ:

Simple random sample. AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Yes, a simple random sample. So you would increase

these numbers if you are using

cluster sampling to account for correlation. You would have to increase

them if you’re giving people a survey, and not all of them

answer the survey, because you need 400 surveys answered. So if only half of the

people answer the survey, then you need to

distribute 800 surveys. AUDIENCE: Do you

recommend calculating also that the standard error after

this so that [INAUDIBLE] make sure? GABRIEL SANCHEZ-MARTINEZ:

Absolutely, yeah. You want to go back and

check with the standard error and when your confidence

interval is and see if you meet it or

if you need to add a few days of data collection. AUDIENCE: Right. GABRIEL SANCHEZ-MARTINEZ: Yeah. OK, so with proportions, you

need a very large sample size to estimate a proportion

if you want accuracy. If you say absolutely

equivalent intolerance of 4%, then you need 600. That’s a big number, so it

just gives you an idea of that. If you get greedy

with the tolerance, you have to pay for the

surveyors to go out. OK. So the process is you determine

the needed sample size just with the discussion of the

equations that we discussed. Then you multiply

the sample sizes. If you’re using stratified

sampling or if you have questions that

have multiple variables, you need to then make sure

that you achieve that sample size for each combination of

things that you’re measuring. So if you’re, for example,

looking at not just boardings, but proportion of

passengers that are car-owning, who are pleased. So you could just independently

measure pleased, independently measure passengers

who own a car. And you might have the

tolerance you need on each one, but if you want the

combination of that, now you need a higher

sample, because you need that number for the

combination of those things. Then there’s a

clustering effect, so a typical thing

if you’re doing the clustering of a whole

vehicle of passengers is to multiply by 4. And then for things like OD

matrices, the rule of thumb is 20 times the number of cells. What does that mean? That if your OD matrix

is quite aggregate, and it’s at the

segment level– so say you divide a root

into two segments, then your OD matrix

has four cells. Four cells times 20, that’s how

many people you have to survey. If you do error

at the stop level, then you have many more stops

and, therefore, many more cells and, therefore, a

much higher sample size. If you have a response

rate that is not 100%, which is always

the case, then you have to expand by 1 minus that

in the reciprocal– sorry, 1 over that in the reciprocal. And then you get a

very large number, and you say I don’t have

the budget for that. And you have to make tradeoffs

and figure out what you can do. And maybe you have to– maybe you can’t collect

this combination and know that accurately, right? So you revise your expectations. OK, with response

rates, you are concerned with getting the

correct answers. You also want to be getting

a high response rate. If you don’t get a high response

rate, there might be a bias. So you have to worry about that. If you have low

response rates, that means you need to

distribute more surveys, and that costs money. And there’s the bias

that I just mentioned, so people who don’t respond may

not be responding for a reason. And then done that

might bias your results. And that might make you

decide something in planning that is not the right decision

based on what actually happens. So we call that the

non-response bias. OK, so what happens? People who don’t respond

might be different or might have responded

differently to the question had they responded. So here’s some examples. If you’re surveying

people who are standing, they are less comfortable. And maybe it’s a crowded bus–

they are less comfortable. Or maybe they’re getting

off one of those stops that is coming up, so they are

less likely to have the time to respond to your survey. People with low

literacy, teenagers, people who don’t

speak the language, are less likely to respond. And they might have

different travel patterns. So if you understand

those things, and you get lower

samples for them, you might be able to do

some sort of correction to those biases. But you have to pay attention. How do you improve

your response rate? Well you can make your

questions shorter. You can do a quick oral survey. That’s what we’re going

to do for this homework. You can try to get information

from automatic sources whenever possible. So if you have an AFC system,

let’s not collect boardings, because we know that. And then of course some

training, and just being kind, and having supervision

helps a lot. OK, here’s some

suggested tolerances for different things. So we’re looking here at

boardings or the peak load. And you see here that

the suggested tolerance is 30%, plus minus 30%, when

you have a route with one to three buses. And then as you have

more and more buses, the tolerance decreases. That means you require

a larger sample. Why is that? Why do you need a

bigger sample if you have a route with more buses? AUDIENCE: You’re less likely

to sample a different bus. GABRIEL SANCHEZ-MARTINEZ: Yes,

and when you have higher– when you have more buses, you

tend to have higher frequency. There’s bunching. OK, so if you then survey

loads, for example, and you only get a few

because of the bunching effect and because there

are more buses, and you’re observing a

smaller percentage of them for a given time

period, say, you’re less likely to have observed

the bus that was really crowded, right? So that means that you need

to decrease your tolerance. And therefore, it’s more

expensive to survey that. OK, good. Trip time– 10% for routes

with less than 20 minutes, 5% with routes of

greater than 20 minutes. Similar concept if you have

greater than 20 minutes– there can be just

more variability, and you really want

to get that right. When you have less

than 20 minutes, your decision on

cycle times and things like this are not going to have

as much impact on the fleet size that you require. As you get bigger running

times, a small percentage change in the mean could

influence how many buses you need to dedicate to

that and the cost of running that service. On-time performance– 10%

absolute equivalent tolerance. These are typical values– don’t

take them as gospel, please. And these are for reporting,

not for anything that’s very critical for operations. Some of them are. Yeah, 30% at least, I would

say, is for reporting. I wouldn’t make any

critical decisions with 30%. On-time performance– we’re

talking here about whether a trip is on time

or not on time– so Bernoulli trials, right? And there’s a

proportion of trips that are on time, and what

we do is that, we essentially say plus– if we say plus

minus 10%, then we’re saying that the sample size

should be 1 over 0.1. Yeah. All right, default

coefficient– these are default values

for coefficient of variation of key data items. Ideally, you have your

own data that you look at, and you don’t resort to this. But if you ever find

yourself in a situation where you need to start

out with something. Here are some based on studies

that previous [AUDIO OUT] They took different routes

and looked at loads and running times for

different time periods and found what the coefficients

of variations were. And here they are on a

table for you to use. In the interest of time, since

I want to discuss the homework, I’m going to stop

here with slide 25. And I’m going to not cover

the whole process, which includes the monitoring phase. And in this slide

here, we have how you establish conversion factor. The conversion factor in

itself has a variance. So there’s some uncertainty

about the relationship that you estimate between

your baseline data item and your auxiliary data item. So you need to consider

that in your sample size. And here are some tables

with some examples of what happens when you require

different– well, when you’re variability of or

your coefficient of variation of your

relationship increases or decreases. OK, let’s look at the homework. I really want to use these

last five minutes for that. So please take one and pass. OK, so the MBTA, there’s

a proposal here in Boston of taking Route 70 and 70A– they run through

Waltham, and they go into around Central Square. And some people are saying those

two routes should be extended to Kendall Square,

because a lot of people are actually going to MIT, or

Kendall Square, or the Kendall Square area– not just Kendall Square Station,

but the whole area around. So if it’s true, A

lot of people could benefit from that extension. And we don’t know. So what are you going to do? You’re going to go

to a specific stop where it is very likely that

the people who would be going to MIT or those areas of Kendall

Square that would benefit from this extension

would alight, and you’re going to

ask people, would you have stayed on your bus

if this bus had continued to MIT and Kendall Square? It’s a simple oral survey, yes

or no question, one question. You’re going to work in

teams of four people. The stop that you’re going

to station yourself in is shown in figure 3. And you’re going to collect

data for the AM peak, from 7:30 to 9:30. You pick the day. The teams are

assigned on Stellar, so please log into Stellar

and see what your team is and coordinate with

them to pick a day. And tell me what that

day is, because– actually, right after

class, I’m going to set up a shared spreadsheet

that you can all access. And just go into that

spreadsheet and pick a day. I’m going to put all the

days that are available, and you can say team

1, team 2, et cetera. Make sure that two teams

don’t go on the same day. We want data from

different days. And you’re going to all

bring that data together in that same

spreadsheet, and there are some questions

for you to analyze the data that you collected,

all of the class collected together. You’re measuring the

percent of people who would have stayed on the bus, right? So it’s a proportion. And one submission per team

in PDF format to Stellar. This is due March

7, but in order to leave you enough

time to do the analysis, the data collection efforts

should be done by February 28. So please submit your data by

the end of Tuesday, February 28 at midnight, say, or sometime

before the beginning of March in the morning,

where a person would be trying to analyze your data. OK, if you have

questions, let me know. And if not, have fun. Remember that assignment

1 is due Thursday. Eric? AUDIENCE: Just the one question:

[? is that ?] [? this is ?] going to miss anyone who is

transferred to the Red Line to then go to Kendall Square. GABRIEL SANCHEZ-MARTINEZ:

And going back to– let’s see. I forget where I had it. Well, I guess what I– there

was a point I made earlier where we can measure that from

automatically collected data, right? AUDIENCE: OK. GABRIEL SANCHEZ-MARTINEZ:

Does that make sense? AUDIENCE: Yeah, people who

[? car up ?] come from 70. GABRIEL

SANCHEZ-MARTINEZ: So if I see you tapping of

the 70 or the 70A, and then I see you

tapping at Central Square, I can infer that you

were using the service to transfer to Central Square. And then we’ll

cover ODX, which is an inference model

for destinations later in this course. But looking at the sequence

of taps, I can infer– we can infer– what the

destination of that bus trip was. We can infer that

it was the stop that was closest to Central. And later that day,

presumably the person who might be going to Kendall

Square Station after work taps to Kendall Square. So I might think, oh, he took

the Red Line from Central to Kendall. So I don’t need to ask those

people where they’re going. And anyway, they might not

care about this extension. So we’re going to stand

on the bus stop that is after Central Square and see

where those people are going and whether they would

have stayed on that bus. AUDIENCE: Is this an

actual [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Some people are proposing it. It is a real proposal. The MBTA is a big organization. So I can’t say that the

MBTA wants to do this or doesn’t want to do this. But some people are interested. And it will get looked into. So it’s useful. AUDIENCE: [? Can ?] [? we ?]

[? share ?] [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ:

Yeah, why not? AUDIENCE: [INAUDIBLE] GABRIEL SANCHEZ-MARTINEZ: Yeah. And I guess one other

thing that I– yeah, so we’re going to

probably make of this like a theme of assignments. So there’s going to be

another assignment on surface planning, operations planning. So we’re going to start looking

at this combination of Route 70 and 70A, and we’re going

to essentially make a thread of this and do

some serious planning on some scenarios where the 70

and the 70A could be merged. And they could maybe be

terminated a little– yeah, we’ll make some

changes to the service plan under some

hypothetical scenarios. And you’ll get a chance to do

an operations plan on these. And then the last homework

will be on policy, so there might be

some policy questions that I have in mind about

what we could do about service outside, on the outer

parts of the 70 and 70A. All right?