python – removing multiple URLs with bs4

I am trying to compile patent files from the USPTO web page with BeautifulSoup.

df('link')
urls=df('link').to_numpy()
urls
for i in urls:
    page = requests.get(i)
    ## storing the content of the page in a variable
    txt = page.text
    ## creating BeautifulSoup object
    soup = bs4.BeautifulSoup(txt, 'html.parser')
    soup

however, it only prints one of the URLs, not the 5 links. I NEED the 5 discarded links as text.

Any suggestions appreciated. Cheers

LINKS I NEED TO REMOVE #

array(('http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
       'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=2&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
       'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=3&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
       'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=4&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n',
       'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=5&f=G&l=50&co1=AND&d=PTXT&s1=g06n.CPCL.&OS=CPCL/g06n&RS=CPCL/g06n'),
      dtype=object)

possible new target URLs from current accounts.

I receive the same message thousands of times, I deleted / blocked the domain / URL in the global system as well as in the specific project (after deactivating all projects except one to isolate the problem)
the same message over and over again is: –
15:46:00: (-) 1/1 PR-0 too low – http://www.gomaze-play.de/index.php?page=Register&action=register
15:46:00: (+) 001 new target URLs possible from current accounts.
it is already listed in
project> options> jump sites with the following words in url / domain

htaccess – Exactly redirect all URLs except one URL which must be redirected to a different URL

You just need to implement your more specific redirect first, before your "generic" redirect directive for everything else.

For example:

RewriteCond %{HTTP_HOST} example.net$ (NC)
RewriteRule ^en$ https://example.com/abc (L,R=301)

RewriteCond %{HTTP_HOST} example.net$ (NC)
RewriteRule ^ https://example.com%{REQUEST_URI} (L,R=301)

Next to…

RewriteCond %{HTTP_HOST} (w*)example.net$ (NC)

the (w*) the prefix is ​​superfluous here. This actually corresponds to the domain example.net – so it still "works", but maybe it's more than what you need?

If you only want to match example.net or www.example.net (as indicated in your question), then modify this condition to read:

RewriteCond %{HTTP_HOST} ^(www.)?example.net$ (NC)

Or, to capture incorrectly entered URLs if you have a generic subdomain:

RewriteCond %{HTTP_HOST} ^(w+.)?example.net$ (NC)

google analytics – SEO factor factor analysis Moving blog URLs from SubDomain to SubDirectory

We are in the process of moving WordPress blog URLs from the subdomains to the subdirectory. We don't physically move the blog, but we use reverse proxy and 301 redirect to do this. What I mean is:

Our main website is the e-commerce market which is the YMYL site. It's on the Windows server. For technical reasons, we cannot physically move our WordPress blog to the main website. So we will do this configuration.

Here is the technical configuration:

  1. Configure a reverse proxy at https://www.example.com/blog/ pointing to https://blog.example.com/
  2. Use a 301 redirect from https://blog.example.com/ to https://www.example.com/blog/ with an exception if traffic is from the primary WWW domain, it will not redirect. So we can eliminate the infinite loop.
  3. Change all absolute URLs to relative URLs on the blog
  4. Change the site map URL from https://blog.example.com/sitemap.xml to https://www.example.com/blog/sitemap.xml and update all the URLs mentioned in the site plan.

SEO risk assessment

We have an individual GA tracking ID and individual Google Search Console properties for the main website and blog. We will not merge them. Keep them separate as is.

With this in mind, I assess the SEO risk factors

  1. Right now, when we receive traffic from the main website to the blog (or vice versa), it is considered referral traffic and new cookies are set for Google Analytics. What will happen when it is on the same domain?
  2. What type of settings should I change in the Google Search Console blog?
    (A). Should I request a "change of address" in the ownership of the blog search console?
    (B). Should I resubmit the site map?
  3. Should I resubmit the blog sitemap from the https://www.example.com/ property of the Google search console?
  4. The main website is the ecommerce market, which is the YMYL website, and the blog is all about content. Does this have an impact on SEO?
  5. Will this dilute the juice of SEO links or have an impact on the ranking of the main website? (A). The average session time of the main website is around 10 minutes and the bounce rate is around 30%. (B). Average blog session time is 33 seconds and bounce rate is over 92%

I have referred to some Google case studies and guidelines, but there is nothing specific related to this case. Need the advice of an expert.

plugins – How to simultaneously update all site redirect URLs to destination URLs

I transferred my Blogger managed website to WordPress.

As you know, Blogger's default permalink format includes the date before the slug and .html in the end, which, to my knowledge, negatively affects SEO. After exporting my site to WordPress, I changed the permalink format simply to mysite.com/slug. I changed the .htaccess to apply 301 redirects to redirect all existing internal URLs to newly applied permalinks.

As for SEO, it worked; it doesn't seem to affect my ranking. However, I think internal URLs would serve my purpose better if they are linked as they are rather than as a redirect URL.

I am using the broken link checker (https://wordpress.org/plugins/broken-link-checker/) to check my broken links. It correctly identifies the following URLs that redirect, although to valid URLs. The plug-in has an option called "Fix redirects" to be applied as bulk actions. I have tried this, but I'm not sure what it does.

enter description of image here

Is there a way for me to update all of my previous URLs to redirected / destination URLs at the same time?

htaccess – How do I redirect all URLs that end with "-2" to the same link without "-2"?

I have a comparison site based on woocommerce and I have the following problem.

I have 1 product from 2 suppliers:

Supplier 1: https://example.com/lista/ceas-accurist-signature-7220-classic/

Seller 2: https://example.com/lista/ceas-accurist-signature-7220-classic-2/

Through .htaccess I want to redirect 301 all links that end in -2/ original, without -2/.

Any idea how I can build the .htaccess rule based on this suffix -2/?

Google has successfully indexed my sitemap, but displays 0 discovered URLs. How can I fix it?

About two weeks ago, I submitted my sitemap to my site's webmaster https://example.com. Its indexing was successful, but no unique URL was found. It shows 0 now. Three weeks have passed, but nothing has changed. I manually indexed each message.

Google has successfully indexed my sitemap, but the URLs have been discovered.

I recently about two weeks ago submitted the sitemap in my site's webmaster https://uzairch.com and its success index but there is no single URL discovered about its shows 0 now 3 weeks ago but nothing changes i have indexed my each post manually please help.

How will Google treat my site if 90% of the URLs redirect and internal links still point to the old URLs?

The sites are reorganized all the time, some for the better, some for the worse.

Use a permanent 301 redirect. This way, Google will understand that you want it to update its index to use the new URLs, not the old ones. Assuming your restructuring directs users to better content, you should see better long-term performance. In the short term, you might see a drop as Google re-indexes and reassesses your site and examines user responses, bounces, and Google's own periodic tests on SERP positions.

python – Scraping data from multiple URLs into a single data frame

I have a class that: 1) goes to a URL 2) scrapes a link and a date (filing_date) out of this page 3) navigate to the link, and 4) scratch the board this page to a data frame.

I also want filing_date from step # 2 added to the data frame, but it is not written correctly in the data frame, probably due to the way I pass data between functions within the class . So rather than passing the respective filing dates for each line, like this:

                     nameOfIssuer                cik Filing Date
0    Agilent Technologies, Inc. (A)  ...  0000846222  2020-01-10
1                 Adient PLC (ADNT)  ...  0000846222  2020-01-10
..                             ...   ...         ...         ...
662            Whirlpool Corp (WHR)  ...  0000846222  2010-07-08

he only spends the latest date scratched from previous page on all lines:

                     nameOfIssuer                cik Filing Date
0    Agilent Technologies, Inc. (A)  ...  0000846222  2010-07-08
1                 Adient PLC (ADNT)  ...  0000846222  2010-07-08
..                             ...   ...         ...         ...
662            Whirlpool Corp (WHR)  ...  0000846222  2010-07-08

I tried to store the dates in an empty list and then add them to the output data frame, but because the length of the list does not match the list of the data frame, I ; get ValueError: Length of values does not match length of index.

Can someone indicate what would be the best approach (for example, create another function to manage only filing_date or maybe return a data frame instead)?

import pandas as pd
from urllib.parse import urljoin
from bs4 import BeautifulSoup, SoupStrainer
import requests

class Scraper:
    BASE_URL = "https://www.sec.gov"
    FORMS_URL_TEMPLATE = "/cgi-bin/browse-edgar?action=getcompany&CIK={cik}&type=13F"

    def __init__(self):
        self.session = requests.Session()

    def get_holdings(self, cik):
        """
        Main function that first finds the most recent 13F form and then passes
        it to scrapeForm to get the holdings for a particular institutional investor.
        """
        # get the form urls
        forms_url = urljoin(self.BASE_URL, self.FORMS_URL_TEMPLATE.format(cik=cik))
        parse_only = SoupStrainer('a', {"id": "documentsbutton"})
        soup = BeautifulSoup(self.session.get(forms_url).content, 'lxml', parse_only=parse_only)
        urls = soup.find_all('a', href=True)

        # get form document URLs
        form_urls = ()
        for url in urls:
            url = url.get("href")
            url = urljoin(self.BASE_URL, str(url))

            headers = {'User-Agent': 'Mozilla/5.0'}
            page = requests.get(url, headers=headers)
            soup = BeautifulSoup(page.content, 'html.parser')

            # Get filing date and "period date"
            dates = soup.find("div", {"class": "formContent"})
            filing_date = dates.find_all("div", {"class": "formGrouping"})(0)
            filing_date = filing_date.find_all("div", {"class": "info"})(0)
            filing_date = filing_date.text

            # get form table URLs
            parse_only = SoupStrainer('tr', {"class": 'blueRow'})
            soup = BeautifulSoup(self.session.get(url).content,'lxml', parse_only=parse_only)
            form_url = soup.find_all('tr', {"class": 'blueRow'})(-1).find('a')('href')
            if ".txt" in form_url:
                pass
            else:
                form_url = urljoin(self.BASE_URL, form_url)
                # print(form_url)
                form_urls.append(form_url)

        return self.scrape_document(form_urls, cik, filing_date)

    def scrape_document(self, urls, cik, filing_date):
        """This function scrapes holdings from particular document URL"""

        cols = ('nameOfIssuer', 'titleOfClass', 'cusip', 'value', 'sshPrnamt',
                'sshPrnamtType', 'putCall', 'investmentDiscretion',
                'otherManager', 'Sole', 'Shared', 'None')

        data = ()

        for url in urls:
            soup = BeautifulSoup(self.session.get(url).content, 'lxml')

            for info_table in soup.find_all(('ns1:infotable', 'infotable')):
                row = ()
                for col in cols:
                    d = info_table.find((col.lower(), 'ns1:' + col.lower()))
                    row.append(d.text.strip() if d else 'NaN')
                data.append(row)

            df = pd.DataFrame(data, columns=cols)
            df('cik') = cik
            df('Filing Date') = filing_date

        return df

holdings = Scraper()
holdings = holdings.get_holdings("0000846222")
print(holdings)