First reading in our data:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
Based on https://www.irs.gov/pub/irs-soi/eo-info.pdf page 3 - we will exclude 00 for foundation code - all orgs except 501c3 to filter down to 501c3 - looks like they are coded as 0 instead of 00.
We start with 9997 rows.
found_00_orgs <- org_data %>% filter(BMF_FOUNDATION_CODE == "0") %>% dplyr::select(ORG_NAME_CURRENT)
found_00_orgs %>% head( n = 10) #show a couple of examples of what will be dropped## Simple feature collection with 10 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -76.70965 ymin: 39.26811 xmax: -76.52996 ymax: 39.3452
## Geodetic CRS: WGS 84
## # A tibble: 10 × 2
## ORG_NAME_CURRENT geometry.x
## <chr> <POINT [°]>
## 1 ANCIENT FREE & ACCEPTED MASONS OF MARYLAND 210 BAL… (-76.61526 39.29176)
## 2 STA OF BALTIMORE-ILA CONTAINER ROYALTY FUND (-76.52996 39.26811)
## 3 AMERICAN CRIMINAL JUSTICE ASSOCIATION LAMBDA ALPHA… (-76.65829 39.3127)
## 4 MARYLAND SLEEP SOCIETY (-76.61849 39.30392)
## 5 AMERICAN FEDERATION OF STATE COUNTY & MUNICIPAL EM… (-76.6195 39.27516)
## 6 NATIONAL ASSOCIATION FOR THE ADVANCEMENT OF COLORE… (-76.70965 39.3452)
## 7 NATIONAL ASSOCIATION FOR THE ADVANCEMENT OF COLORE… (-76.70965 39.3452)
## 8 KNIGHTS OF COLUMBUS (-76.61726 39.32261)
## 9 CATON HEALTH CORPORATION (-76.67106 39.27302)
## 10 MYRA GRAND CHAPTER ORDER OF THE EASTERN STARS (-76.62622 39.30425)
org_data <- org_data %>% subset(is.na(BMF_FOUNDATION_CODE)| BMF_FOUNDATION_CODE != "0")# keeping NA valuesAfter filtering out these orgs, we end with 8150 rows, removing 1847 orgs with foundation code 00.
Post_offices <- org_data %>% filter(str_detect(F990_ORG_ADDR_STREET, "PO BOX |POST OFFICE"))
Post_offices %>% dplyr::select(F990_ORG_ADDR_STREET)## Simple feature collection with 422 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -76.69207 ymin: 39.24325 xmax: -76.54166 ymax: 39.36672
## Geodetic CRS: WGS 84
## # A tibble: 422 × 2
## F990_ORG_ADDR_STREET geometry.x
## <chr> <POINT [°]>
## 1 PO BOX 66182 (-76.58814 39.36046)
## 2 PO BOX 27297 (-76.67225 39.3118)
## 3 PO BOX 11714 (-76.5559 39.33483)
## 4 PO BOX 67612 (-76.68345 39.3461)
## 5 PO BOX 66462 (-76.58843 39.36335)
## 6 PO BOX 2035 (-76.60412 39.29132)
## 7 PO BOX 1652 (-76.60454 39.29244)
## 8 PO BOX 79502 (-76.56761 39.29284)
## 9 PO BOX 39584 (-76.60998 39.36408)
## 10 PO BOX 2727 (-76.60911 39.24325)
## # ℹ 412 more rows
org_data <- org_data %>% filter(!str_detect(F990_ORG_ADDR_STREET, "PO BOX |POST OFFICE")| is.na(F990_ORG_ADDR_STREET)) # keeping NA values)After filtering out PO Boxes we end up with 7707 rows, removing 422 orgs with a post office box address.
We want to remove these and set them aside.
Let’s keep on their own as an aside but remove from main data.
no_oper_orgs<-org_data %>% filter(BMF_FOUNDATION_CODE %in% c("4","17")) %>% dplyr::select(ORG_NAME_CURRENT) #show examples of what will be dropped
no_oper_orgs %>% head( n = 10)## Simple feature collection with 10 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -76.70683 ymin: 39.28089 xmax: -76.57429 ymax: 39.35976
## Geodetic CRS: WGS 84
## # A tibble: 10 × 2
## ORG_NAME_CURRENT geometry.x
## <chr> <POINT [°]>
## 1 RESPECT OUTREACH CENTER INC (-76.69427 39.33412)
## 2 JA ART WORK LTD (-76.61015 39.34546)
## 3 HUMANIZE (-76.60831 39.32435)
## 4 CREW ENSEMBLE INC (-76.61448 39.2978)
## 5 BROWN ADVISORY CHARITABLE FOUNDATION INC (-76.59497 39.28089)
## 6 SISTERS OF FELLOWSHIP (-76.57898 39.35976)
## 7 WE DO OUR BEST AT RECOVERY INC 12 STEPS (-76.57429 39.32038)
## 8 KING MEMORIAL CHILD CARE FAMILY ACADEMY INC (-76.70683 39.31009)
## 9 SHILOH RESOURCE COMMUNITY AND DEVELOPMENT CENTER (-76.68396 39.31561)
## 10 VICE SQUAD MINISTRIES INC (-76.66784 39.34583)
org_data <- org_data %>% subset(is.na(BMF_FOUNDATION_CODE)| ! BMF_FOUNDATION_CODE %in% c("4","17"))# keeping NA valuesAfter filtering out these orgs we end up with 6843 rows, removing 864 orgs with foundation code 04 or 17.
Filing is due May 15 each year, since the data is from March 2024, that means it only shows 2023 complete data.
Orgs only need to file every 3 years as well, but this would be a sliding window for orgs at different times as, but we could require that the last filing time be within the last 3 complete years.
According to the dictionary:
If we assume a sliding window and people would need to have filed in the last 3 complete years, than we can include orgs that filed in 2021, 2022, and 2023 (as well as the incomplete 2024).
active_orgs <- org_data %>% filter(ORG_YEAR_LAST %in% c(2024, 2023, 2022, 2021))
non_active_orgs <- org_data %>% filter(! ORG_YEAR_LAST %in% c(2024, 2023, 2022, 2021))After filtering out orgs that filed between May 15, 2023 and March 2024, we have 6843 rows, removing 3016 orgs with that filed at that time.
After filtering out orgs that last filed in the last 3 complete years, we end up with 3827.
Now let’s remove orgs from industrial neighborhoods with Population of 0 or NA.
Let’s first check what those neighborhoods are.
## Simple feature collection with 2 features and 108 fields
## Active geometry column: geometry.x
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -76.62243 ymin: 39.25142 xmax: -76.615 ymax: 39.25437
## Geodetic CRS: WGS 84
## # A tibble: 2 × 110
## EIN2 EIN NTEE_IRS NTEE_NCCS NTEEV2 NCCS_LEVEL_1 NCCS_LEVEL_2 NCCS_LEVEL_3
## * <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 EIN-5… 5.20e8 E220 E220 <NA> 501C3 CHARI… O HE
## 2 EIN-5… 5.21e8 N67 N67 HMS-N… 501C3 CHARI… O HS
## # ℹ 102 more variables: F990_TOTAL_REVENUE_RECENT <dbl>,
## # F990_TOTAL_INCOME_RECENT <dbl>, F990_TOTAL_ASSETS_RECENT <dbl>,
## # F990_ORG_ADDR_CITY <chr>, F990_ORG_ADDR_STATE <chr>,
## # F990_ORG_ADDR_ZIP <chr>, F990_ORG_ADDR_STREET <chr>,
## # CENSUS_CBSA_FIPS <dbl>, CENSUS_CBSA_NAME <chr>, CENSUS_BLOCK_FIPS <dbl>,
## # CENSUS_URBAN_AREA <chr>, CENSUS_STATE_ABBR <chr>, CENSUS_COUNTY_NAME <chr>,
## # ORG_ADDR_FULL <chr>, ORG_ADDR_MATCH <chr>, GEOCODER_SCORE <dbl>, …
popzero <-active_orgs %>% filter(Population== 0) %>% as.data.frame()
popzero %>% dplyr::select(Name)## Name
## 1 Herring Run Park
## 2 Carroll Park
## 3 Gwynns Falls/Leakin Park
## 4 Spring Garden Industrial Area
## 5 Carroll Park
## 6 Carroll Park
## 7 Spring Garden Industrial Area
## 8 Spring Garden Industrial Area
## 9 Spring Garden Industrial Area
## 10 Spring Garden Industrial Area
## 11 Spring Garden Industrial Area
## 12 Spring Garden Industrial Area
OK these make sense. Let’s remove them.
## [1] 3827 110
active_orgs <-active_orgs %>% filter(Population!= 0) %>% filter(!is.na(Population))
dim(active_orgs)## [1] 3813 110
## [1] 12 110